A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction

Digna R. Velez, Bill C. White, Alison A. Motsinger, William S. Bush, Marylyn Deriggi Ritchie, Scott M. Williams, Jason H. Moore

Research output: Contribution to journalArticle

213 Citations (Scopus)

Abstract

Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1:1, 1:2, 1:4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.

Original languageEnglish (US)
Pages (from-to)306-315
Number of pages10
JournalGenetic Epidemiology
Volume31
Issue number4
DOIs
StatePublished - May 1 2007

Fingerprint

Multifactor Dimensionality Reduction
Gene Frequency
Sample Size
Datasets
Sensitivity and Specificity

All Science Journal Classification (ASJC) codes

  • Epidemiology
  • Genetics(clinical)

Cite this

Velez, D. R., White, B. C., Motsinger, A. A., Bush, W. S., Ritchie, M. D., Williams, S. M., & Moore, J. H. (2007). A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genetic Epidemiology, 31(4), 306-315. https://doi.org/10.1002/gepi.20211
Velez, Digna R. ; White, Bill C. ; Motsinger, Alison A. ; Bush, William S. ; Ritchie, Marylyn Deriggi ; Williams, Scott M. ; Moore, Jason H. / A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. In: Genetic Epidemiology. 2007 ; Vol. 31, No. 4. pp. 306-315.
@article{bae24028279c43cb8fac3c2c1a161145,
title = "A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction",
abstract = "Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1:1, 1:2, 1:4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.",
author = "Velez, {Digna R.} and White, {Bill C.} and Motsinger, {Alison A.} and Bush, {William S.} and Ritchie, {Marylyn Deriggi} and Williams, {Scott M.} and Moore, {Jason H.}",
year = "2007",
month = "5",
day = "1",
doi = "10.1002/gepi.20211",
language = "English (US)",
volume = "31",
pages = "306--315",
journal = "Genetic Epidemiology",
issn = "0741-0395",
publisher = "Wiley-Liss Inc.",
number = "4",

}

Velez, DR, White, BC, Motsinger, AA, Bush, WS, Ritchie, MD, Williams, SM & Moore, JH 2007, 'A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction', Genetic Epidemiology, vol. 31, no. 4, pp. 306-315. https://doi.org/10.1002/gepi.20211

A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. / Velez, Digna R.; White, Bill C.; Motsinger, Alison A.; Bush, William S.; Ritchie, Marylyn Deriggi; Williams, Scott M.; Moore, Jason H.

In: Genetic Epidemiology, Vol. 31, No. 4, 01.05.2007, p. 306-315.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction

AU - Velez, Digna R.

AU - White, Bill C.

AU - Motsinger, Alison A.

AU - Bush, William S.

AU - Ritchie, Marylyn Deriggi

AU - Williams, Scott M.

AU - Moore, Jason H.

PY - 2007/5/1

Y1 - 2007/5/1

N2 - Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1:1, 1:2, 1:4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.

AB - Multifactor dimensionality reduction (MDR) was developed as a method for detecting statistical patterns of epistasis. The overall goal of MDR is to change the representation space of the data to make interactions easier to detect. It is well known that machine learning methods may not provide robust models when the class variable (e.g. case-control status) is imbalanced and accuracy is used as the fitness measure. This is because most methods learn patterns that are relevant for the larger of the two classes. The goal of this study was to evaluate three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets. The methods evaluated were: (1) over-sampling that resamples with replacement the smaller class until the data are balanced, (2) under-sampling that randomly removes subjects from the larger class until the data are balanced, and (3) balanced accuracy [(sensitivity+specificity)/2] as the fitness function with and without an adjusted threshold. These three methods were compared using simulated data with two-locus epistatic interactions of varying heritability (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and minor allele frequency (0.2, 0.4) that were embedded in 100 replicate datasets of varying sample sizes (400, 800, 1600). Each dataset was generated with different ratios of cases to controls (1:1, 1:2, 1:4). We found that the balanced accuracy function with an adjusted threshold significantly outperformed both over-sampling and under-sampling and fully recovered the power. These results suggest that balanced accuracy should be used instead of accuracy for the MDR analysis of epistasis in imbalanced datasets.

UR - http://www.scopus.com/inward/record.url?scp=34247338434&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34247338434&partnerID=8YFLogxK

U2 - 10.1002/gepi.20211

DO - 10.1002/gepi.20211

M3 - Article

C2 - 17323372

AN - SCOPUS:34247338434

VL - 31

SP - 306

EP - 315

JO - Genetic Epidemiology

JF - Genetic Epidemiology

SN - 0741-0395

IS - 4

ER -