Combining hashing and abstraction in sparse high dimensional feature spaces

Cornelia Caragea, Adrian Silvescu, Prasenjit Mitra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, the development of effective and efficient classification methods is needed. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" and n-gram representations can result in prohibitively high dimensional input spaces. Data mining algorithms applied to these input spaces may be intractable due to the large number of dimensions. Thus, dimensionality reduction algorithms that can process data into features fast at runtime, ideally in constant time per feature, are greatly needed in high throughput applications, where the number of features and data points can be in the order of millions. One promising line of research to dimensionality reduction is feature clustering. We propose to combine two types of feature clustering, namely hashing and abstraction based on hierarchical agglomerative clustering, in order to take advantage of the strengths of both techniques. Experimental results on two text data sets show that the combined approach uses significantly smaller number of features and gives similar performance when compared with the "bag of words" and n-gram approaches.

Original languageEnglish (US)
Title of host publicationAAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference
Pages3-9
Number of pages7
Volume1
StatePublished - Nov 7 2012
Event26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12 - Toronto, ON, Canada
Duration: Jul 22 2012Jul 26 2012

Other

Other26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12
CountryCanada
CityToronto, ON
Period7/22/127/26/12

Fingerprint

Data mining
Classifiers
Throughput

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

Cite this

Caragea, C., Silvescu, A., & Mitra, P. (2012). Combining hashing and abstraction in sparse high dimensional feature spaces. In AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference (Vol. 1, pp. 3-9)
Caragea, Cornelia ; Silvescu, Adrian ; Mitra, Prasenjit. / Combining hashing and abstraction in sparse high dimensional feature spaces. AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference. Vol. 1 2012. pp. 3-9
@inproceedings{0dc8b7fa7ba64f9784a5e34faf07b83e,
title = "Combining hashing and abstraction in sparse high dimensional feature spaces",
abstract = "With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, the development of effective and efficient classification methods is needed. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used {"}bag of words{"} and n-gram representations can result in prohibitively high dimensional input spaces. Data mining algorithms applied to these input spaces may be intractable due to the large number of dimensions. Thus, dimensionality reduction algorithms that can process data into features fast at runtime, ideally in constant time per feature, are greatly needed in high throughput applications, where the number of features and data points can be in the order of millions. One promising line of research to dimensionality reduction is feature clustering. We propose to combine two types of feature clustering, namely hashing and abstraction based on hierarchical agglomerative clustering, in order to take advantage of the strengths of both techniques. Experimental results on two text data sets show that the combined approach uses significantly smaller number of features and gives similar performance when compared with the {"}bag of words{"} and n-gram approaches.",
author = "Cornelia Caragea and Adrian Silvescu and Prasenjit Mitra",
year = "2012",
month = "11",
day = "7",
language = "English (US)",
isbn = "9781577355687",
volume = "1",
pages = "3--9",
booktitle = "AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference",

}

Caragea, C, Silvescu, A & Mitra, P 2012, Combining hashing and abstraction in sparse high dimensional feature spaces. in AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference. vol. 1, pp. 3-9, 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12, Toronto, ON, Canada, 7/22/12.

Combining hashing and abstraction in sparse high dimensional feature spaces. / Caragea, Cornelia; Silvescu, Adrian; Mitra, Prasenjit.

AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference. Vol. 1 2012. p. 3-9.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Combining hashing and abstraction in sparse high dimensional feature spaces

AU - Caragea, Cornelia

AU - Silvescu, Adrian

AU - Mitra, Prasenjit

PY - 2012/11/7

Y1 - 2012/11/7

N2 - With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, the development of effective and efficient classification methods is needed. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" and n-gram representations can result in prohibitively high dimensional input spaces. Data mining algorithms applied to these input spaces may be intractable due to the large number of dimensions. Thus, dimensionality reduction algorithms that can process data into features fast at runtime, ideally in constant time per feature, are greatly needed in high throughput applications, where the number of features and data points can be in the order of millions. One promising line of research to dimensionality reduction is feature clustering. We propose to combine two types of feature clustering, namely hashing and abstraction based on hierarchical agglomerative clustering, in order to take advantage of the strengths of both techniques. Experimental results on two text data sets show that the combined approach uses significantly smaller number of features and gives similar performance when compared with the "bag of words" and n-gram approaches.

AB - With the exponential increase in the number of documents available online, e.g., news articles, weblogs, scientific documents, the development of effective and efficient classification methods is needed. The performance of document classifiers critically depends, among other things, on the choice of the feature representation. The commonly used "bag of words" and n-gram representations can result in prohibitively high dimensional input spaces. Data mining algorithms applied to these input spaces may be intractable due to the large number of dimensions. Thus, dimensionality reduction algorithms that can process data into features fast at runtime, ideally in constant time per feature, are greatly needed in high throughput applications, where the number of features and data points can be in the order of millions. One promising line of research to dimensionality reduction is feature clustering. We propose to combine two types of feature clustering, namely hashing and abstraction based on hierarchical agglomerative clustering, in order to take advantage of the strengths of both techniques. Experimental results on two text data sets show that the combined approach uses significantly smaller number of features and gives similar performance when compared with the "bag of words" and n-gram approaches.

UR - http://www.scopus.com/inward/record.url?scp=84868279238&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84868279238&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781577355687

VL - 1

SP - 3

EP - 9

BT - AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference

ER -

Caragea C, Silvescu A, Mitra P. Combining hashing and abstraction in sparse high dimensional feature spaces. In AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference. Vol. 1. 2012. p. 3-9