De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Kristoffer Sahlin, Paul Medvedev

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.

Original languageEnglish (US)
Title of host publicationResearch in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings
EditorsLenore J. Cowen
PublisherSpringer Verlag
Pages227-242
Number of pages16
ISBN (Print)9783030170820
DOIs
StatePublished - Jan 1 2019
Event23rd International Conference on Research in Computational Molecular Biology, RECOMB 2019 - Washington, United States
Duration: May 5 2019May 8 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11467 LNBI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference23rd International Conference on Research in Computational Molecular Biology, RECOMB 2019
CountryUnited States
CityWashington
Period5/5/195/8/19

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm'. Together they form a unique fingerprint.

  • Cite this

    Sahlin, K., & Medvedev, P. (2019). De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. In L. J. Cowen (Ed.), Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings (pp. 227-242). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11467 LNBI). Springer Verlag. https://doi.org/10.1007/978-3-030-17083-7_14