GeneScissors

A comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

Zhaojun Zhang, Shunping Huang, Jack Wang, Xiang Zhang, Fernando Pardo Manuel De Villena, Leonard Mcmillan, Wei Wang

Research output: Contribution to journalArticle

9 Citations (Scopus)

Abstract

Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives.

Original languageEnglish (US)
JournalBioinformatics
Volume29
Issue number13
DOIs
StatePublished - Jul 1 2013

Fingerprint

Misalignment
RNA
Transcriptome
Genes
Pipelines
Pseudogenes
Gene
False Positive
Alignment
Systematic errors
Base Pairing
Systematic Error
Differential Expression
Learning systems
Genomics
Observation
Genome
Databases
Machine Learning
Fragment

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Zhang, Zhaojun ; Huang, Shunping ; Wang, Jack ; Zhang, Xiang ; Pardo Manuel De Villena, Fernando ; Mcmillan, Leonard ; Wang, Wei. / GeneScissors : A comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment. In: Bioinformatics. 2013 ; Vol. 29, No. 13.
@article{180fc1aa11134cfd971b68fa786ec6e4,
title = "GeneScissors: A comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment",
abstract = "Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5{\%} of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0{\%} of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90{\%}. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6{\%} less pseudogenes and 0.97{\%} more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0{\%} unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3{\%} of them are false positives.",
author = "Zhaojun Zhang and Shunping Huang and Jack Wang and Xiang Zhang and {Pardo Manuel De Villena}, Fernando and Leonard Mcmillan and Wei Wang",
year = "2013",
month = "7",
day = "1",
doi = "10.1093/bioinformatics/btt216",
language = "English (US)",
volume = "29",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "13",

}

GeneScissors : A comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment. / Zhang, Zhaojun; Huang, Shunping; Wang, Jack; Zhang, Xiang; Pardo Manuel De Villena, Fernando; Mcmillan, Leonard; Wang, Wei.

In: Bioinformatics, Vol. 29, No. 13, 01.07.2013.

Research output: Contribution to journalArticle

TY - JOUR

T1 - GeneScissors

T2 - A comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment

AU - Zhang, Zhaojun

AU - Huang, Shunping

AU - Wang, Jack

AU - Zhang, Xiang

AU - Pardo Manuel De Villena, Fernando

AU - Mcmillan, Leonard

AU - Wang, Wei

PY - 2013/7/1

Y1 - 2013/7/1

N2 - Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives.

AB - Motivation: RNA-seq techniques provide an unparalleled means for exploring a transcriptome with deep coverage and base pair level resolution. Various analysis tools have been developed to align and assemble RNA-seq data, such as the widely used TopHat/Cufflinks pipeline. A common observation is that a sizable fraction of the fragments/reads align to multiple locations of the genome. These multiple alignments pose substantial challenges to existing RNA-seq analysis tools. Inappropriate treatment may result in reporting spurious expressed genes (false positives) and missing the real expressed genes (false negatives). Such errors impact the subsequent analysis, such as differential expression analysis. In our study, we observe that ∼3.5% of transcripts reported by TopHat/Cufflinks pipeline correspond to annotated nonfunctional pseudogenes. Moreover, ∼10.0% of reported transcripts are not annotated in the Ensembl database. These genes could be either novel expressed genes or false discoveries.Results: We examine the underlying genomic features that lead to multiple alignments and investigate how they generate systematic errors in RNA-seq analysis. We develop a general tool, GeneScissors, which exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods. In our simulated study, GeneScissors can predict spurious transcriptome calls owing to misalignment with an accuracy close to 90%. It provides substantial improvement over the widely used TopHat/Cufflinks or MapSplice/Cufflinks pipelines in both precision and F-measurement. On real data, GeneScissors reports 53.6% less pseudogenes and 0.97% more expressed and annotated transcripts, when compared with the TopHat/Cufflinks pipeline. In addition, among the 10.0% unannotated transcripts reported by TopHat/Cufflinks, GeneScissors finds that >16.3% of them are false positives.

UR - http://www.scopus.com/inward/record.url?scp=84879968568&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84879968568&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btt216

DO - 10.1093/bioinformatics/btt216

M3 - Article

VL - 29

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 13

ER -