Cluster vector space model

A dimensionality reduction method for text classifications based on the vector quantization

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The Word Vector Space Model (WVSM), a widely used one in text analytics, provides an elegant way to enable computers to understand natural languages by converting words into numbers. With impressive results in syntactic and semantic tasks in natural language processing (NLP), WVSM is widely used in many search engines, information retrieval systems, as well as text classification. However, because the basic elements of the feature space are words the model has a high dimensionality and is at risk of overfitting. An advanced prediction system with multiple models can easily have a longer training time under WVSM. In this paper, a Cluster Vector Space Model (CVSM) based on vector quantization is used for the dimensionality reduction. This method transfers a given word vector space into a much smaller cluster vector space. The results indicate that the CVSM, with less than 1% of the original feature size, works at least as well as the WVSM in binary classification problem; in multi-class classification problems, with less than 1% of the original feature size, CVSM increases the performance of decision tree model.

Original languageEnglish (US)
Title of host publication67th Annual Conference and Expo of the Institute of Industrial Engineers 2017
PublisherInstitute of Industrial Engineers
Pages428-433
Number of pages6
ISBN (Electronic)9780983762461
StatePublished - 2017
Event67th Annual Conference and Expo of the Institute of Industrial Engineers 2017 - Pittsburgh, United States
Duration: May 20 2017May 23 2017

Other

Other67th Annual Conference and Expo of the Institute of Industrial Engineers 2017
CountryUnited States
CityPittsburgh
Period5/20/175/23/17

Fingerprint

Vector quantization
Vector spaces
Word processing
Information retrieval systems
Syntactics
Search engines
Decision trees
Semantics

All Science Journal Classification (ASJC) codes

  • Industrial and Manufacturing Engineering

Cite this

Julaiti, J., & Tirupatikumara, S. R. (2017). Cluster vector space model: A dimensionality reduction method for text classifications based on the vector quantization. In 67th Annual Conference and Expo of the Institute of Industrial Engineers 2017 (pp. 428-433). Institute of Industrial Engineers.
Julaiti, Juxihong ; Tirupatikumara, Soundar Rajan. / Cluster vector space model : A dimensionality reduction method for text classifications based on the vector quantization. 67th Annual Conference and Expo of the Institute of Industrial Engineers 2017. Institute of Industrial Engineers, 2017. pp. 428-433
@inproceedings{c9779d95369c402c8ee4a30e55460d21,
title = "Cluster vector space model: A dimensionality reduction method for text classifications based on the vector quantization",
abstract = "The Word Vector Space Model (WVSM), a widely used one in text analytics, provides an elegant way to enable computers to understand natural languages by converting words into numbers. With impressive results in syntactic and semantic tasks in natural language processing (NLP), WVSM is widely used in many search engines, information retrieval systems, as well as text classification. However, because the basic elements of the feature space are words the model has a high dimensionality and is at risk of overfitting. An advanced prediction system with multiple models can easily have a longer training time under WVSM. In this paper, a Cluster Vector Space Model (CVSM) based on vector quantization is used for the dimensionality reduction. This method transfers a given word vector space into a much smaller cluster vector space. The results indicate that the CVSM, with less than 1{\%} of the original feature size, works at least as well as the WVSM in binary classification problem; in multi-class classification problems, with less than 1{\%} of the original feature size, CVSM increases the performance of decision tree model.",
author = "Juxihong Julaiti and Tirupatikumara, {Soundar Rajan}",
year = "2017",
language = "English (US)",
pages = "428--433",
booktitle = "67th Annual Conference and Expo of the Institute of Industrial Engineers 2017",
publisher = "Institute of Industrial Engineers",
address = "United States",

}

Julaiti, J & Tirupatikumara, SR 2017, Cluster vector space model: A dimensionality reduction method for text classifications based on the vector quantization. in 67th Annual Conference and Expo of the Institute of Industrial Engineers 2017. Institute of Industrial Engineers, pp. 428-433, 67th Annual Conference and Expo of the Institute of Industrial Engineers 2017, Pittsburgh, United States, 5/20/17.

Cluster vector space model : A dimensionality reduction method for text classifications based on the vector quantization. / Julaiti, Juxihong; Tirupatikumara, Soundar Rajan.

67th Annual Conference and Expo of the Institute of Industrial Engineers 2017. Institute of Industrial Engineers, 2017. p. 428-433.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Cluster vector space model

T2 - A dimensionality reduction method for text classifications based on the vector quantization

AU - Julaiti, Juxihong

AU - Tirupatikumara, Soundar Rajan

PY - 2017

Y1 - 2017

N2 - The Word Vector Space Model (WVSM), a widely used one in text analytics, provides an elegant way to enable computers to understand natural languages by converting words into numbers. With impressive results in syntactic and semantic tasks in natural language processing (NLP), WVSM is widely used in many search engines, information retrieval systems, as well as text classification. However, because the basic elements of the feature space are words the model has a high dimensionality and is at risk of overfitting. An advanced prediction system with multiple models can easily have a longer training time under WVSM. In this paper, a Cluster Vector Space Model (CVSM) based on vector quantization is used for the dimensionality reduction. This method transfers a given word vector space into a much smaller cluster vector space. The results indicate that the CVSM, with less than 1% of the original feature size, works at least as well as the WVSM in binary classification problem; in multi-class classification problems, with less than 1% of the original feature size, CVSM increases the performance of decision tree model.

AB - The Word Vector Space Model (WVSM), a widely used one in text analytics, provides an elegant way to enable computers to understand natural languages by converting words into numbers. With impressive results in syntactic and semantic tasks in natural language processing (NLP), WVSM is widely used in many search engines, information retrieval systems, as well as text classification. However, because the basic elements of the feature space are words the model has a high dimensionality and is at risk of overfitting. An advanced prediction system with multiple models can easily have a longer training time under WVSM. In this paper, a Cluster Vector Space Model (CVSM) based on vector quantization is used for the dimensionality reduction. This method transfers a given word vector space into a much smaller cluster vector space. The results indicate that the CVSM, with less than 1% of the original feature size, works at least as well as the WVSM in binary classification problem; in multi-class classification problems, with less than 1% of the original feature size, CVSM increases the performance of decision tree model.

UR - http://www.scopus.com/inward/record.url?scp=85030989592&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85030989592&partnerID=8YFLogxK

M3 - Conference contribution

SP - 428

EP - 433

BT - 67th Annual Conference and Expo of the Institute of Industrial Engineers 2017

PB - Institute of Industrial Engineers

ER -

Julaiti J, Tirupatikumara SR. Cluster vector space model: A dimensionality reduction method for text classifications based on the vector quantization. In 67th Annual Conference and Expo of the Institute of Industrial Engineers 2017. Institute of Industrial Engineers. 2017. p. 428-433