EST clustering error evaluation and correction

Ji Ping Z. Wang, Bruce G. Lindsay, James Leebens-Mack, Liying Cui, Kerr Wall, Webb C. Miller, Claude Walker Depamphilis

Research output: Contribution to journalArticle

58 Citations (Scopus)

Abstract

Motivation: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. Results: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5′ and 3′ EST clustering, the Type I error in the 5′ EST case is ∼10 times higher than the 3′ EST case (30% versus 3%). An over-stringent identity rule, e.g., P ≥ 95%, may even inflate the Type I error in both cases. We demonstrate that ∼80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5′ EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.

Original languageEnglish (US)
Pages (from-to)2973-2984
Number of pages12
JournalBioinformatics
Volume20
Issue number17
DOIs
StatePublished - Nov 22 2004

Fingerprint

Expressed Sequence Tags
Cluster Analysis
Clustering
Type I error
Evaluation
Type II error
Gene
Genes
Error Correction
CDNA
Estimate
Gene Expression
Error Rate
Overlap
Quantify
Error correction
Distinct
Gene expression
Multigene Family
Gene Library

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Wang, J. P. Z., Lindsay, B. G., Leebens-Mack, J., Cui, L., Wall, K., Miller, W. C., & Depamphilis, C. W. (2004). EST clustering error evaluation and correction. Bioinformatics, 20(17), 2973-2984. https://doi.org/10.1093/bioinformatics/bth342
Wang, Ji Ping Z. ; Lindsay, Bruce G. ; Leebens-Mack, James ; Cui, Liying ; Wall, Kerr ; Miller, Webb C. ; Depamphilis, Claude Walker. / EST clustering error evaluation and correction. In: Bioinformatics. 2004 ; Vol. 20, No. 17. pp. 2973-2984.
@article{4f0e4ee5fde94e37b0b73c2180e7c40b,
title = "EST clustering error evaluation and correction",
abstract = "Motivation: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. Results: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5{\%} for both 5′ and 3′ EST clustering, the Type I error in the 5′ EST case is ∼10 times higher than the 3′ EST case (30{\%} versus 3{\%}). An over-stringent identity rule, e.g., P ≥ 95{\%}, may even inflate the Type I error in both cases. We demonstrate that ∼80{\%} of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5′ EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.",
author = "Wang, {Ji Ping Z.} and Lindsay, {Bruce G.} and James Leebens-Mack and Liying Cui and Kerr Wall and Miller, {Webb C.} and Depamphilis, {Claude Walker}",
year = "2004",
month = "11",
day = "22",
doi = "10.1093/bioinformatics/bth342",
language = "English (US)",
volume = "20",
pages = "2973--2984",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "17",

}

Wang, JPZ, Lindsay, BG, Leebens-Mack, J, Cui, L, Wall, K, Miller, WC & Depamphilis, CW 2004, 'EST clustering error evaluation and correction', Bioinformatics, vol. 20, no. 17, pp. 2973-2984. https://doi.org/10.1093/bioinformatics/bth342

EST clustering error evaluation and correction. / Wang, Ji Ping Z.; Lindsay, Bruce G.; Leebens-Mack, James; Cui, Liying; Wall, Kerr; Miller, Webb C.; Depamphilis, Claude Walker.

In: Bioinformatics, Vol. 20, No. 17, 22.11.2004, p. 2973-2984.

Research output: Contribution to journalArticle

TY - JOUR

T1 - EST clustering error evaluation and correction

AU - Wang, Ji Ping Z.

AU - Lindsay, Bruce G.

AU - Leebens-Mack, James

AU - Cui, Liying

AU - Wall, Kerr

AU - Miller, Webb C.

AU - Depamphilis, Claude Walker

PY - 2004/11/22

Y1 - 2004/11/22

N2 - Motivation: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. Results: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5′ and 3′ EST clustering, the Type I error in the 5′ EST case is ∼10 times higher than the 3′ EST case (30% versus 3%). An over-stringent identity rule, e.g., P ≥ 95%, may even inflate the Type I error in both cases. We demonstrate that ∼80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5′ EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.

AB - Motivation: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated. Results: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5′ and 3′ EST clustering, the Type I error in the 5′ EST case is ∼10 times higher than the 3′ EST case (30% versus 3%). An over-stringent identity rule, e.g., P ≥ 95%, may even inflate the Type I error in both cases. We demonstrate that ∼80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5′ EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.

UR - http://www.scopus.com/inward/record.url?scp=10244224129&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=10244224129&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bth342

DO - 10.1093/bioinformatics/bth342

M3 - Article

VL - 20

SP - 2973

EP - 2984

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 17

ER -

Wang JPZ, Lindsay BG, Leebens-Mack J, Cui L, Wall K, Miller WC et al. EST clustering error evaluation and correction. Bioinformatics. 2004 Nov 22;20(17):2973-2984. https://doi.org/10.1093/bioinformatics/bth342