CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition

Feng Pan, Xiang Zhang, Wei Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Citations (Scopus)

Abstract

The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, mieroarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually 0(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.

Original languageEnglish (US)
Title of host publicationSIGMOD 2008
Subtitle of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data 2008
Pages173-184
Number of pages12
DOIs
StatePublished - Dec 10 2008
Event2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08 - Vancouver, BC, Canada
Duration: Jun 9 2008Jun 12 2008

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Other

Other2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08
CountryCanada
CityVancouver, BC
Period6/9/086/12/08

Fingerprint

Clustering algorithms
Sampling
Decomposition
Data storage equipment
Recommender systems
Data mining
Systems analysis
Costs
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Cite this

Pan, F., Zhang, X., & Wang, W. (2008). CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008 (pp. 173-184). [1376637] (Proceedings of the ACM SIGMOD International Conference on Management of Data). https://doi.org/10.1145/1376616.1376637
Pan, Feng ; Zhang, Xiang ; Wang, Wei. / CRD : Fast co-clustering on large datasets utilizing sampling-based matrix decomposition. SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008. 2008. pp. 173-184 (Proceedings of the ACM SIGMOD International Conference on Management of Data).
@inproceedings{61ba62c2294b4db9a2454ad7ca08b8df,
title = "CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition",
abstract = "The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, mieroarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually 0(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.",
author = "Feng Pan and Xiang Zhang and Wei Wang",
year = "2008",
month = "12",
day = "10",
doi = "10.1145/1376616.1376637",
language = "English (US)",
isbn = "9781605581026",
series = "Proceedings of the ACM SIGMOD International Conference on Management of Data",
pages = "173--184",
booktitle = "SIGMOD 2008",

}

Pan, F, Zhang, X & Wang, W 2008, CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition. in SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008., 1376637, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 173-184, 2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08, Vancouver, BC, Canada, 6/9/08. https://doi.org/10.1145/1376616.1376637

CRD : Fast co-clustering on large datasets utilizing sampling-based matrix decomposition. / Pan, Feng; Zhang, Xiang; Wang, Wei.

SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008. 2008. p. 173-184 1376637 (Proceedings of the ACM SIGMOD International Conference on Management of Data).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - CRD

T2 - Fast co-clustering on large datasets utilizing sampling-based matrix decomposition

AU - Pan, Feng

AU - Zhang, Xiang

AU - Wang, Wei

PY - 2008/12/10

Y1 - 2008/12/10

N2 - The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, mieroarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually 0(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.

AB - The problem of simultaneously clustering columns and rows (co-clustering) arises in important applications, such as text data mining, mieroarray analysis, and recommendation system analysis. Compared with the classical clustering algorithms, co-clustering algorithms have been shown to be more effective in discovering hidden clustering structures in the data matrix. The complexity of previous co-clustering algorithms is usually 0(m × n), where m and n are the numbers of rows and columns in the data matrix respectively. This limits their applicability to data matrices involving a large number of columns and rows. Moreover, some huge datasets can not be entirely held in main memory during co-clustering which violates the assumption made by the previous algorithms. In this paper, we propose a general framework for fast co-clustering large datasets, CRD. By utilizing recently developed sampling-based matrix decomposition methods, CRD achieves an execution time linear in m and n. Also, CRD does not require the whole data matrix be in the main memory. We conducted extensive experiments on both real and synthetic data. Compared with previous co-clustering algorithms, CRD achieves competitive accuracy but with much less computational cost.

UR - http://www.scopus.com/inward/record.url?scp=57149147732&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=57149147732&partnerID=8YFLogxK

U2 - 10.1145/1376616.1376637

DO - 10.1145/1376616.1376637

M3 - Conference contribution

AN - SCOPUS:57149147732

SN - 9781605581026

T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data

SP - 173

EP - 184

BT - SIGMOD 2008

ER -

Pan F, Zhang X, Wang W. CRD: Fast co-clustering on large datasets utilizing sampling-based matrix decomposition. In SIGMOD 2008: Proceedings of the ACM SIGMOD International Conference on Management of Data 2008. 2008. p. 173-184. 1376637. (Proceedings of the ACM SIGMOD International Conference on Management of Data). https://doi.org/10.1145/1376616.1376637