Abstract
Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.
Original language | English (US) |
---|---|
Pages (from-to) | 751-770 |
Number of pages | 20 |
Journal | Annals of Applied Statistics |
Volume | 11 |
Issue number | 2 |
DOIs | |
State | Published - Jun 2017 |
Fingerprint
All Science Journal Classification (ASJC) codes
- Statistics and Probability
- Modeling and Simulation
- Statistics, Probability and Uncertainty
Cite this
}
Assignment of endogenous retrovirus integration sites using a mixture model. / Hunter, David R.; Bao, Le; Poss, Mary.
In: Annals of Applied Statistics, Vol. 11, No. 2, 06.2017, p. 751-770.Research output: Contribution to journal › Article
TY - JOUR
T1 - Assignment of endogenous retrovirus integration sites using a mixture model
AU - Hunter, David R.
AU - Bao, Le
AU - Poss, Mary
PY - 2017/6
Y1 - 2017/6
N2 - Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.
AB - Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.
UR - http://www.scopus.com/inward/record.url?scp=85026325078&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85026325078&partnerID=8YFLogxK
U2 - 10.1214/16-AOAS1016
DO - 10.1214/16-AOAS1016
M3 - Article
AN - SCOPUS:85026325078
VL - 11
SP - 751
EP - 770
JO - Annals of Applied Statistics
JF - Annals of Applied Statistics
SN - 1932-6157
IS - 2
ER -