Protein sequence classification using feature hashing

Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra

Research output: Contribution to journalArticle

21 Citations (Scopus)

Abstract

Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

Original languageEnglish (US)
Article numberS14
JournalProteome Science
Volume10
DOIs
StatePublished - Jan 1 2012

Fingerprint

Proteins
Hash functions
Data Mining
Learning algorithms
Data mining
Learning
Technology

All Science Journal Classification (ASJC) codes

  • Biochemistry
  • Molecular Biology

Cite this

Caragea, Cornelia ; Silvescu, Adrian ; Mitra, Prasenjit. / Protein sequence classification using feature hashing. In: Proteome Science. 2012 ; Vol. 10.
@article{4844d119c0b44b08b8665b374ea40f1e,
title = "Protein sequence classification using feature hashing",
abstract = "Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is {"}reduced{"} by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and {"}aggregating{"} their counts. We compare feature hashing with the {"}bag of k-grams{"} approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.",
author = "Cornelia Caragea and Adrian Silvescu and Prasenjit Mitra",
year = "2012",
month = "1",
day = "1",
doi = "10.1186/1477-5956-10-s1-s14",
language = "English (US)",
volume = "10",
journal = "Proteome Science",
issn = "1477-5956",
publisher = "BioMed Central",

}

Protein sequence classification using feature hashing. / Caragea, Cornelia; Silvescu, Adrian; Mitra, Prasenjit.

In: Proteome Science, Vol. 10, S14, 01.01.2012.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Protein sequence classification using feature hashing

AU - Caragea, Cornelia

AU - Silvescu, Adrian

AU - Mitra, Prasenjit

PY - 2012/1/1

Y1 - 2012/1/1

N2 - Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

AB - Recent advances in next-generation sequencing technologies have resulted in an exponential increase in the rate at which protein sequence data are being acquired. The k-gram feature representation, commonly used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. In this paper, we study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by hashing the features into a low-dimensional space, using a hash function, i.e., by mapping features into hash keys, where multiple features can be mapped (at random) to the same hash key, and "aggregating" their counts. We compare feature hashing with the "bag of k-grams" approach. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.

UR - http://www.scopus.com/inward/record.url?scp=85011792680&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85011792680&partnerID=8YFLogxK

U2 - 10.1186/1477-5956-10-s1-s14

DO - 10.1186/1477-5956-10-s1-s14

M3 - Article

AN - SCOPUS:85011792680

VL - 10

JO - Proteome Science

JF - Proteome Science

SN - 1477-5956

M1 - S14

ER -