DeepBound: Accurate identification of transcript boundaries via deep convolutional neural fields

Mingfu Shao, Jianzhu Ma, Sheng Wang

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Motivation: Reconstructing the full-length expressed transcripts (a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak. Results: We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods.

Original languageEnglish (US)
Pages (from-to)i267-i273
JournalBioinformatics
Volume33
Issue number14
DOIs
StatePublished - Jul 15 2017

Fingerprint

RNA
Genes
Statistical Models
Alignment
Gene expression
Area Under Curve
Labels
Gene
Gene Expression
Training Samples
Graphical Models
Transition Probability
Probabilistic Model
Sequencing
Experimental Study
Objective function
Datasets
Curve
Model
Simulation

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

@article{d76328b0a84c47e1bf2bc4b800fcd36e,
title = "DeepBound: Accurate identification of transcript boundaries via deep convolutional neural fields",
abstract = "Motivation: Reconstructing the full-length expressed transcripts (a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak. Results: We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods.",
author = "Mingfu Shao and Jianzhu Ma and Sheng Wang",
year = "2017",
month = "7",
day = "15",
doi = "10.1093/bioinformatics/btx267",
language = "English (US)",
volume = "33",
pages = "i267--i273",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "14",

}

DeepBound : Accurate identification of transcript boundaries via deep convolutional neural fields. / Shao, Mingfu; Ma, Jianzhu; Wang, Sheng.

In: Bioinformatics, Vol. 33, No. 14, 15.07.2017, p. i267-i273.

Research output: Contribution to journalArticle

TY - JOUR

T1 - DeepBound

T2 - Accurate identification of transcript boundaries via deep convolutional neural fields

AU - Shao, Mingfu

AU - Ma, Jianzhu

AU - Wang, Sheng

PY - 2017/7/15

Y1 - 2017/7/15

N2 - Motivation: Reconstructing the full-length expressed transcripts (a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak. Results: We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods.

AB - Motivation: Reconstructing the full-length expressed transcripts (a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak. Results: We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods.

UR - http://www.scopus.com/inward/record.url?scp=85024485395&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85024485395&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btx267

DO - 10.1093/bioinformatics/btx267

M3 - Article

C2 - 28881999

AN - SCOPUS:85024485395

VL - 33

SP - i267-i273

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 14

ER -