Kernel based methods for accelerated failure time model with ultra-high dimensional data

Zhenqiu Liu, Dechang Chen, Ming Tan, Feng Jiang, Ronald B. Gartenhaus

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Background: Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with L1 and Lp penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n ≪ m (the number of genes), directly identifying a small subset of genes from ultra-high (m > 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.Results: The accelerated failure time (AFT) model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller n × n matrix. It is very efficient when the number of unknown variables (genes) is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited.Conclusions: Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.

Original languageEnglish (US)
Article number606
JournalBMC bioinformatics
Volume11
DOIs
StatePublished - Dec 21 2010

Fingerprint

Accelerated Failure Time Model
High-dimensional Data
Genes
kernel
Gene
Genomics
Survival Analysis
Variable Selection
Penalty
Linear Models
High-dimensional
Prognostic Factors
Microarray Analysis
Kernel Regression
Cox Model
Ridge Regression
Statistical tests
Dual Problem
Regularization Method
Microarrays

All Science Journal Classification (ASJC) codes

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Liu, Zhenqiu ; Chen, Dechang ; Tan, Ming ; Jiang, Feng ; Gartenhaus, Ronald B. / Kernel based methods for accelerated failure time model with ultra-high dimensional data. In: BMC bioinformatics. 2010 ; Vol. 11.
@article{b587b8c0a4ad401d9c14b48a35f2a407,
title = "Kernel based methods for accelerated failure time model with ultra-high dimensional data",
abstract = "Background: Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with L1 and Lp penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n ≪ m (the number of genes), directly identifying a small subset of genes from ultra-high (m > 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.Results: The accelerated failure time (AFT) model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller n × n matrix. It is very efficient when the number of unknown variables (genes) is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited.Conclusions: Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.",
author = "Zhenqiu Liu and Dechang Chen and Ming Tan and Feng Jiang and Gartenhaus, {Ronald B.}",
year = "2010",
month = "12",
day = "21",
doi = "10.1186/1471-2105-11-606",
language = "English (US)",
volume = "11",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

Kernel based methods for accelerated failure time model with ultra-high dimensional data. / Liu, Zhenqiu; Chen, Dechang; Tan, Ming; Jiang, Feng; Gartenhaus, Ronald B.

In: BMC bioinformatics, Vol. 11, 606, 21.12.2010.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Kernel based methods for accelerated failure time model with ultra-high dimensional data

AU - Liu, Zhenqiu

AU - Chen, Dechang

AU - Tan, Ming

AU - Jiang, Feng

AU - Gartenhaus, Ronald B.

PY - 2010/12/21

Y1 - 2010/12/21

N2 - Background: Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with L1 and Lp penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n ≪ m (the number of genes), directly identifying a small subset of genes from ultra-high (m > 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.Results: The accelerated failure time (AFT) model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller n × n matrix. It is very efficient when the number of unknown variables (genes) is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited.Conclusions: Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.

AB - Background: Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with L1 and Lp penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n ≪ m (the number of genes), directly identifying a small subset of genes from ultra-high (m > 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.Results: The accelerated failure time (AFT) model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller n × n matrix. It is very efficient when the number of unknown variables (genes) is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited.Conclusions: Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.

UR - http://www.scopus.com/inward/record.url?scp=78650279819&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650279819&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-11-606

DO - 10.1186/1471-2105-11-606

M3 - Article

C2 - 21176134

AN - SCOPUS:78650279819

VL - 11

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 606

ER -