TY - JOUR
T1 - Accurate assembly of multi-end RNA-seq data with Scallop2
AU - Zhang, Qimin
AU - Shi, Qian
AU - Shao, Mingfu
N1 - Funding Information:
This work is partly supported by the US National Science Foundation (DBI-2019797 to M.S.), the US National Institutes of Health (R01HG011065 to M.S.) and the Charles K. Etner Career Development Professorship awarded to M.S. by The Pennsylvania State University. Initial algorithmic exploration of Scallop2 advancements were conducted with C. Kingsford (Computational Biology Department, School of Computer Science, Carnegie Mellon University) and were supported by the Gordon and Betty Moore Foundation (GMBF 4554 to C. Kingsford) and the US National Institutes of Health (R01GM122935 to C. Kingsford).
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Nature America, Inc.
PY - 2022/3
Y1 - 2022/3
N2 - Modern RNA-sequencing (RNA-seq) protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core of Scallop2 consists of three steps: (1) using an algorithm to ‘bridge’ multi-end reads into single-end phasing paths in the context of a splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge and (3) piping the refined splice graph and bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on ten Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared with two popular assemblers (StringTie2 and Scallop).
AB - Modern RNA-sequencing (RNA-seq) protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core of Scallop2 consists of three steps: (1) using an algorithm to ‘bridge’ multi-end reads into single-end phasing paths in the context of a splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge and (3) piping the refined splice graph and bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on ten Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared with two popular assemblers (StringTie2 and Scallop).
UR - http://www.scopus.com/inward/record.url?scp=85127286565&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127286565&partnerID=8YFLogxK
U2 - 10.1038/s43588-022-00216-1
DO - 10.1038/s43588-022-00216-1
M3 - Article
C2 - 36713932
AN - SCOPUS:85127286565
SN - 2662-8457
VL - 2
SP - 148
EP - 152
JO - Nature Computational Science
JF - Nature Computational Science
IS - 3
ER -