Keywords: bioinformatics, computational biology, recurrent neural networks, genomics
Oxford nanopore sequencing (ONT) allows identification of RNA and DNA sequences. When subjected to an electric field, pore forming proteins allow for the translocation of polynucleotides between two compartments filled with electrolytic solution. The passage of each different nucleotide inside the pore creates a distinct, measurable alteration in the ionic current, and the controlled threading of a DNA or RNA chain across the pore allows for the sequential identification of each nucleotide. This recent technology already started making a significant impact, allowing for the sequencing of long DNA sequences, and also long polyadenylated RNA.
Recently, it has been shown that this technology, combined with novel machine learning algorithms, is able to distinguish modified and unmodified nucleotides since each modification comes with a characteristic alteration of ionic currents. This makes ONT a promising technology in the study of epigenetics, epitranscriptomics and tRNA regulation.
In this semester project, the students will apply and further develop the machine learning algorithms needed for base calling (parsing of voltage traces into the RNA alphabet), such as recurrent neural network (RNN) using long short-term memory (LSTM) to identify long nucleotide sequences and their modifications. Test sets and train sets coming from a recent dataset that our lab has produced will be provided. The main challenge will be to find, improve and implement the best state of the art network architecture. The students will be free to use library and language of their choice, although an implementation using PyTorch would be preferred.
Contact: [email protected]