Assignment of endogenous retrovirus integration sites using a mixture model

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.

Original languageEnglish (US)
Pages (from-to)751-770
Number of pages20
JournalAnnals of Applied Statistics
Volume11
Issue number2
DOIs
StatePublished - Jun 2017

Fingerprint

Mixture Model
Count
Genome
Assignment
Genes
Negative binomial distribution
Component Model
Clustering algorithms
High Throughput
Clustering Algorithm
Genomics
Assign
Animals
Throughput
Resources
Alternatives
Demonstrate
Mixture model
Model

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Modeling and Simulation
  • Statistics, Probability and Uncertainty

Cite this

@article{872c56a48a084a5bb3b990967270bc15,
title = "Assignment of endogenous retrovirus integration sites using a mixture model",
abstract = "Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.",
author = "Hunter, {David R.} and Le Bao and Mary Poss",
year = "2017",
month = "6",
doi = "10.1214/16-AOAS1016",
language = "English (US)",
volume = "11",
pages = "751--770",
journal = "Annals of Applied Statistics",
issn = "1932-6157",
publisher = "Institute of Mathematical Statistics",
number = "2",

}

Assignment of endogenous retrovirus integration sites using a mixture model. / Hunter, David R.; Bao, Le; Poss, Mary.

In: Annals of Applied Statistics, Vol. 11, No. 2, 06.2017, p. 751-770.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Assignment of endogenous retrovirus integration sites using a mixture model

AU - Hunter, David R.

AU - Bao, Le

AU - Poss, Mary

PY - 2017/6

Y1 - 2017/6

N2 - Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.

AB - Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer (Odocoileus hemionus), a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.

UR - http://www.scopus.com/inward/record.url?scp=85026325078&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85026325078&partnerID=8YFLogxK

U2 - 10.1214/16-AOAS1016

DO - 10.1214/16-AOAS1016

M3 - Article

AN - SCOPUS:85026325078

VL - 11

SP - 751

EP - 770

JO - Annals of Applied Statistics

JF - Annals of Applied Statistics

SN - 1932-6157

IS - 2

ER -