De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Kristoffer Sahlin, Paul Medvedev

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.

Original languageEnglish (US)
Title of host publicationResearch in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings
EditorsLenore J. Cowen
PublisherSpringer Verlag
Pages227-242
Number of pages16
ISBN (Print)9783030170820
DOIs
StatePublished - Jan 1 2019
Event23rd International Conference on Research in Computational Molecular Biology, RECOMB 2019 - Washington, United States
Duration: May 5 2019May 8 2019

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11467 LNBI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference23rd International Conference on Research in Computational Molecular Biology, RECOMB 2019
CountryUnited States
CityWashington
Period5/5/195/8/19

Fingerprint

Clustering
Nanopore
Nanopores
Breadth
Reconstruction Algorithm
Clustering algorithms
Large Data Sets
Sequencing
Clustering Algorithm
Error Rate
Scalability
Genes
Gene
Demonstrate
Family

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Sahlin, K., & Medvedev, P. (2019). De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. In L. J. Cowen (Ed.), Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings (pp. 227-242). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11467 LNBI). Springer Verlag. https://doi.org/10.1007/978-3-030-17083-7_14
Sahlin, Kristoffer ; Medvedev, Paul. / De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings. editor / Lenore J. Cowen. Springer Verlag, 2019. pp. 227-242 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{620c84e71ea34e52b4c5cd449a83795f,
title = "De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm",
abstract = "Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.",
author = "Kristoffer Sahlin and Paul Medvedev",
year = "2019",
month = "1",
day = "1",
doi = "10.1007/978-3-030-17083-7_14",
language = "English (US)",
isbn = "9783030170820",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "227--242",
editor = "Cowen, {Lenore J.}",
booktitle = "Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings",
address = "Germany",

}

Sahlin, K & Medvedev, P 2019, De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. in LJ Cowen (ed.), Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11467 LNBI, Springer Verlag, pp. 227-242, 23rd International Conference on Research in Computational Molecular Biology, RECOMB 2019, Washington, United States, 5/5/19. https://doi.org/10.1007/978-3-030-17083-7_14

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. / Sahlin, Kristoffer; Medvedev, Paul.

Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings. ed. / Lenore J. Cowen. Springer Verlag, 2019. p. 227-242 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11467 LNBI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

AU - Sahlin, Kristoffer

AU - Medvedev, Paul

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.

AB - Long-read sequencing of transcripts with PacBio Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (in order to scale) and makes use of quality values (in order to handle variable error rates). We test isONclust on three simulated and five biological datasets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets. Our tool is available at https://github.com/ksahlin/isONclust.

UR - http://www.scopus.com/inward/record.url?scp=85065538920&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85065538920&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-17083-7_14

DO - 10.1007/978-3-030-17083-7_14

M3 - Conference contribution

AN - SCOPUS:85065538920

SN - 9783030170820

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 227

EP - 242

BT - Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings

A2 - Cowen, Lenore J.

PB - Springer Verlag

ER -

Sahlin K, Medvedev P. De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm. In Cowen LJ, editor, Research in Computational Molecular Biology - 23rd Annual International Conference, RECOMB 2019, Proceedings. Springer Verlag. 2019. p. 227-242. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-030-17083-7_14