A generalized linear model for peak calling in ChIP-Seq data

Jialin Xu, Yu Zhang

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) has become a routine for detecting genome-wide protein-DNA interaction. The success of ChIP-Seq data analysis highly depends on the quality of peak calling (i.e., to detect peaks of tag counts at a genomic location and evaluate if the peak corresponds to a real protein-DNA interaction event). The challenges in peak calling include (1) how to combine the forward and the reverse strand tag data to improve the power of peak calling and (2) how to account for the variation of tag data observed across different genomic locations. We introduce a new peak calling method based on the generalized linear model (GLMNB) that utilizes negative binomial distribution to model the tag count data and account for the variation of background tags that may randomly bind to the DNA sequence at varying levels due to local genomic structures and sequence contents. We allow local shifting of peaks observed on the forward and the reverse stands, such that at each potential binding site, a binding profile representing the pattern of a real peak signal is fitted to best explain the observed tag data with maximum likelihood. Our method can also detect multiple peaks within a local region if there are multiple binding sites in the region.

Original languageEnglish (US)
Pages (from-to)826-838
Number of pages13
JournalJournal of Computational Biology
Volume19
Issue number6
DOIs
StatePublished - Jun 1 2012

Fingerprint

Generalized Linear Model
Binding sites
Linear Models
DNA
Chip
Binding Sites
Binomial Distribution
Proteins
High-Throughput Nucleotide Sequencing
Chromatin Immunoprecipitation
DNA sequences
Maximum likelihood
Genomics
Genes
Genome
Reverse
Protein
Negative binomial distribution
Count Data
Chromatin

All Science Journal Classification (ASJC) codes

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Cite this

Xu, Jialin ; Zhang, Yu. / A generalized linear model for peak calling in ChIP-Seq data. In: Journal of Computational Biology. 2012 ; Vol. 19, No. 6. pp. 826-838.
@article{185ee9279b6e4d81801951320837ea60,
title = "A generalized linear model for peak calling in ChIP-Seq data",
abstract = "Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) has become a routine for detecting genome-wide protein-DNA interaction. The success of ChIP-Seq data analysis highly depends on the quality of peak calling (i.e., to detect peaks of tag counts at a genomic location and evaluate if the peak corresponds to a real protein-DNA interaction event). The challenges in peak calling include (1) how to combine the forward and the reverse strand tag data to improve the power of peak calling and (2) how to account for the variation of tag data observed across different genomic locations. We introduce a new peak calling method based on the generalized linear model (GLMNB) that utilizes negative binomial distribution to model the tag count data and account for the variation of background tags that may randomly bind to the DNA sequence at varying levels due to local genomic structures and sequence contents. We allow local shifting of peaks observed on the forward and the reverse stands, such that at each potential binding site, a binding profile representing the pattern of a real peak signal is fitted to best explain the observed tag data with maximum likelihood. Our method can also detect multiple peaks within a local region if there are multiple binding sites in the region.",
author = "Jialin Xu and Yu Zhang",
year = "2012",
month = "6",
day = "1",
doi = "10.1089/cmb.2012.0023",
language = "English (US)",
volume = "19",
pages = "826--838",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "6",

}

A generalized linear model for peak calling in ChIP-Seq data. / Xu, Jialin; Zhang, Yu.

In: Journal of Computational Biology, Vol. 19, No. 6, 01.06.2012, p. 826-838.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A generalized linear model for peak calling in ChIP-Seq data

AU - Xu, Jialin

AU - Zhang, Yu

PY - 2012/6/1

Y1 - 2012/6/1

N2 - Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) has become a routine for detecting genome-wide protein-DNA interaction. The success of ChIP-Seq data analysis highly depends on the quality of peak calling (i.e., to detect peaks of tag counts at a genomic location and evaluate if the peak corresponds to a real protein-DNA interaction event). The challenges in peak calling include (1) how to combine the forward and the reverse strand tag data to improve the power of peak calling and (2) how to account for the variation of tag data observed across different genomic locations. We introduce a new peak calling method based on the generalized linear model (GLMNB) that utilizes negative binomial distribution to model the tag count data and account for the variation of background tags that may randomly bind to the DNA sequence at varying levels due to local genomic structures and sequence contents. We allow local shifting of peaks observed on the forward and the reverse stands, such that at each potential binding site, a binding profile representing the pattern of a real peak signal is fitted to best explain the observed tag data with maximum likelihood. Our method can also detect multiple peaks within a local region if there are multiple binding sites in the region.

AB - Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) has become a routine for detecting genome-wide protein-DNA interaction. The success of ChIP-Seq data analysis highly depends on the quality of peak calling (i.e., to detect peaks of tag counts at a genomic location and evaluate if the peak corresponds to a real protein-DNA interaction event). The challenges in peak calling include (1) how to combine the forward and the reverse strand tag data to improve the power of peak calling and (2) how to account for the variation of tag data observed across different genomic locations. We introduce a new peak calling method based on the generalized linear model (GLMNB) that utilizes negative binomial distribution to model the tag count data and account for the variation of background tags that may randomly bind to the DNA sequence at varying levels due to local genomic structures and sequence contents. We allow local shifting of peaks observed on the forward and the reverse stands, such that at each potential binding site, a binding profile representing the pattern of a real peak signal is fitted to best explain the observed tag data with maximum likelihood. Our method can also detect multiple peaks within a local region if there are multiple binding sites in the region.

UR - http://www.scopus.com/inward/record.url?scp=84862570727&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862570727&partnerID=8YFLogxK

U2 - 10.1089/cmb.2012.0023

DO - 10.1089/cmb.2012.0023

M3 - Article

C2 - 22533622

AN - SCOPUS:84862570727

VL - 19

SP - 826

EP - 838

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 6

ER -