AllSome Sequence Bloom Trees

Chen Sun, Robert S. Harris, Rayan Chikhi, Paul Medvedev

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

The ubiquity of next-generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2652 human RNA-seq experiments uploaded to the Sequence Read Archive (SRA). Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this article, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39%-85%, with a price of upto 3 × memory consumption during queries. Notably, it can query a batch of 198,074 queries in <8 hours (compared with around 2 days previously) and a whole set of k-mers from a sequencing experiment (about 27 million k-mers) in <11 minutes.

Original languageEnglish (US)
Pages (from-to)467-479
Number of pages13
JournalJournal of Computational Biology
Volume25
Issue number5
DOIs
StatePublished - May 2018

Fingerprint

Data structures
Databases
Query
RNA
Sequencing
Data Structures
Experiments
Data storage equipment
Tree Structure
Indexing
Batch
Experiment
Archives

All Science Journal Classification (ASJC) codes

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Cite this

Sun, Chen ; Harris, Robert S. ; Chikhi, Rayan ; Medvedev, Paul. / AllSome Sequence Bloom Trees. In: Journal of Computational Biology. 2018 ; Vol. 25, No. 5. pp. 467-479.
@article{e05837ce351a44e4bc1c69e12a9f3e5a,
title = "AllSome Sequence Bloom Trees",
abstract = "The ubiquity of next-generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2652 human RNA-seq experiments uploaded to the Sequence Read Archive (SRA). Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this article, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7{\%} and query time by 39{\%}-85{\%}, with a price of upto 3 × memory consumption during queries. Notably, it can query a batch of 198,074 queries in <8 hours (compared with around 2 days previously) and a whole set of k-mers from a sequencing experiment (about 27 million k-mers) in <11 minutes.",
author = "Chen Sun and Harris, {Robert S.} and Rayan Chikhi and Paul Medvedev",
year = "2018",
month = "5",
doi = "10.1089/cmb.2017.0258",
language = "English (US)",
volume = "25",
pages = "467--479",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "5",

}

Sun, C, Harris, RS, Chikhi, R & Medvedev, P 2018, 'AllSome Sequence Bloom Trees', Journal of Computational Biology, vol. 25, no. 5, pp. 467-479. https://doi.org/10.1089/cmb.2017.0258

AllSome Sequence Bloom Trees. / Sun, Chen; Harris, Robert S.; Chikhi, Rayan; Medvedev, Paul.

In: Journal of Computational Biology, Vol. 25, No. 5, 05.2018, p. 467-479.

Research output: Contribution to journalArticle

TY - JOUR

T1 - AllSome Sequence Bloom Trees

AU - Sun, Chen

AU - Harris, Robert S.

AU - Chikhi, Rayan

AU - Medvedev, Paul

PY - 2018/5

Y1 - 2018/5

N2 - The ubiquity of next-generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2652 human RNA-seq experiments uploaded to the Sequence Read Archive (SRA). Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this article, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39%-85%, with a price of upto 3 × memory consumption during queries. Notably, it can query a batch of 198,074 queries in <8 hours (compared with around 2 days previously) and a whole set of k-mers from a sequencing experiment (about 27 million k-mers) in <11 minutes.

AB - The ubiquity of next-generation sequencing has transformed the size and nature of many databases, pushing the boundaries of current indexing and searching methods. One particular example is a database of 2652 human RNA-seq experiments uploaded to the Sequence Read Archive (SRA). Recently, Solomon and Kingsford proposed the Sequence Bloom Tree data structure and demonstrated how it can be used to accurately identify SRA samples that have a transcript of interest potentially expressed. In this article, we propose an improvement called the AllSome Sequence Bloom Tree. Results show that our new data structure significantly improves performance, reducing the tree construction time by 52.7% and query time by 39%-85%, with a price of upto 3 × memory consumption during queries. Notably, it can query a batch of 198,074 queries in <8 hours (compared with around 2 days previously) and a whole set of k-mers from a sequencing experiment (about 27 million k-mers) in <11 minutes.

UR - http://www.scopus.com/inward/record.url?scp=85046906307&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85046906307&partnerID=8YFLogxK

U2 - 10.1089/cmb.2017.0258

DO - 10.1089/cmb.2017.0258

M3 - Article

C2 - 29620920

AN - SCOPUS:85046906307

VL - 25

SP - 467

EP - 479

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 5

ER -