NanoSplicer (Yupei You et al, 2022) is a program that accurately identifies splice junctions using Oxford Nanopore Sequencing data. It performs well on cDNA data but suffers when direct RNA (dRNA) data is used. This project identifies failure modes and frequencies of NanoSplicer on dRNA by classifying NanoSplicer alignments. Future research will examine methods to tackle these errors.
Splicing is a process that occurs very frequently in eukaryotic (including human!) cells. It involves joining different ‘exons’ (coding parts of a gene) to form multiple possible strings of messenger RNA. This means many different proteins can be formed by the same gene!
Sampling some cells and examining the quantity and presence of these different ‘isoforms’ is useful for many things like understanding disease or examining new organisms. However, it is hard to do this with the most-commonly-used sequencing type – ‘short-read sequencing’. This is because short reads provide less information about the sequence of exons – they’re like very small puzzle pieces. Long-read sequencing techniques, like Oxford Nanopore Sequencing, provide more information but have higher error rates.
NanoSplicer (Yupei You et al, 2020) is an algorithm that solves this issue, by using the raw electrical signal produced by Oxford Nanopore Sequencing and using a mixture model to choose the optimal candidate for each Junction Within Read (a read that maps to multiple non-adjacent exons)., NanoSplicer works well on complementary DNA (DNA produced by reverse-transcribing RNA into DNA) but struggles on direct RNA. Direct RNA sequencing is useful because it requires less lab-work to sequence – less chemistry and less preparation time.
However, NanoSplicer suffers on direct RNA sequencing. This is what my project investigates. When NanoSplicer is run on directRNA data, of 498 classified reads, the baseline (control) mapping software, minimap2, outperforms or equals it in all but 5 instances. Minimap2 finds the ground truth in 380 cases whereas NanoSplicer2, which considers the minimap2 candidate, only succeeds in 196 cases.
Further, in my project, I visually assessed the 189 reads where minimap2 found the ground truth but NanoSplicer failed. Of these, only 49 of the NanoSplicer choices were visually bad candidates. 114 reads looked good but didn’t match the ground truth. This could indicate a number of error sources: Tombo, the tool used to ‘re-squiggle’ RNA to its original raw electrical signal, could be failing, or the Dynamic Time Warping algorithm used to align squiggles (controlling for dwell time) may be failing due to the increased dwell time variation of RNA.
Overall, there is a need for further research. Tombo performance on direct RNA should be further investigated, and by-nucleotide segmentation should be applied to the squiggles to minimise the impact of dwell time on the Dynamic Time Warping algorithm.
Patrick Grave
The University of Melbourne