A general framework for fast co-clustering on large datasets using matrix decomposition

Feng Pan, Xiang Zhang, Wei Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Simultaneously clustering columns and rows (coclustering) of large data matrix is an important problem with wide applications, such as document mining, microarray analysis, and recommendation systems. Several co-clustering algorithms have been shown effective in discovering hidden clustering structures in the data matrix. For a data matrix of m rows and n columns, the time complexity of these methods is usually in the order of m × n (if not higher). This limits their applicability to data matrices involving a large number of columns and rows. Moreover, an implicit assumption made by existing co-clustering methods is that the whole data matrix needs to be held in the main memory. In this paper, we propose a general framework, CRD, for co-clustering large datasets utilizing recently developed sampling-based matrix decomposition methods. The time complexity of our approach is linear in m and n. And it does not require the whole data matrix be in the main memory. Experimental results show that CRD achieves competitive accuracy to existing co-clustering methods but with much less computational cost.

Original languageEnglish (US)
Title of host publicationProceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08
Pages1337-1339
Number of pages3
DOIs
StatePublished - Oct 1 2008
Event2008 IEEE 24th International Conference on Data Engineering, ICDE'08 - Cancun, Mexico
Duration: Apr 7 2008Apr 12 2008

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627

Other

Other2008 IEEE 24th International Conference on Data Engineering, ICDE'08
CountryMexico
CityCancun
Period4/7/084/12/08

Fingerprint

Decomposition
Data storage equipment
Recommender systems
Microarrays
Clustering algorithms
Sampling
Costs

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Information Systems

Cite this

Pan, F., Zhang, X., & Wang, W. (2008). A general framework for fast co-clustering on large datasets using matrix decomposition. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08 (pp. 1337-1339). [4497548] (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDE.2008.4497548
Pan, Feng ; Zhang, Xiang ; Wang, Wei. / A general framework for fast co-clustering on large datasets using matrix decomposition. Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08. 2008. pp. 1337-1339 (Proceedings - International Conference on Data Engineering).
@inproceedings{0b83aa37ef174780af4b4c57fd699266,
title = "A general framework for fast co-clustering on large datasets using matrix decomposition",
abstract = "Simultaneously clustering columns and rows (coclustering) of large data matrix is an important problem with wide applications, such as document mining, microarray analysis, and recommendation systems. Several co-clustering algorithms have been shown effective in discovering hidden clustering structures in the data matrix. For a data matrix of m rows and n columns, the time complexity of these methods is usually in the order of m × n (if not higher). This limits their applicability to data matrices involving a large number of columns and rows. Moreover, an implicit assumption made by existing co-clustering methods is that the whole data matrix needs to be held in the main memory. In this paper, we propose a general framework, CRD, for co-clustering large datasets utilizing recently developed sampling-based matrix decomposition methods. The time complexity of our approach is linear in m and n. And it does not require the whole data matrix be in the main memory. Experimental results show that CRD achieves competitive accuracy to existing co-clustering methods but with much less computational cost.",
author = "Feng Pan and Xiang Zhang and Wei Wang",
year = "2008",
month = "10",
day = "1",
doi = "10.1109/ICDE.2008.4497548",
language = "English (US)",
isbn = "9781424418374",
series = "Proceedings - International Conference on Data Engineering",
pages = "1337--1339",
booktitle = "Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08",

}

Pan, F, Zhang, X & Wang, W 2008, A general framework for fast co-clustering on large datasets using matrix decomposition. in Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08., 4497548, Proceedings - International Conference on Data Engineering, pp. 1337-1339, 2008 IEEE 24th International Conference on Data Engineering, ICDE'08, Cancun, Mexico, 4/7/08. https://doi.org/10.1109/ICDE.2008.4497548

A general framework for fast co-clustering on large datasets using matrix decomposition. / Pan, Feng; Zhang, Xiang; Wang, Wei.

Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08. 2008. p. 1337-1339 4497548 (Proceedings - International Conference on Data Engineering).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - A general framework for fast co-clustering on large datasets using matrix decomposition

AU - Pan, Feng

AU - Zhang, Xiang

AU - Wang, Wei

PY - 2008/10/1

Y1 - 2008/10/1

N2 - Simultaneously clustering columns and rows (coclustering) of large data matrix is an important problem with wide applications, such as document mining, microarray analysis, and recommendation systems. Several co-clustering algorithms have been shown effective in discovering hidden clustering structures in the data matrix. For a data matrix of m rows and n columns, the time complexity of these methods is usually in the order of m × n (if not higher). This limits their applicability to data matrices involving a large number of columns and rows. Moreover, an implicit assumption made by existing co-clustering methods is that the whole data matrix needs to be held in the main memory. In this paper, we propose a general framework, CRD, for co-clustering large datasets utilizing recently developed sampling-based matrix decomposition methods. The time complexity of our approach is linear in m and n. And it does not require the whole data matrix be in the main memory. Experimental results show that CRD achieves competitive accuracy to existing co-clustering methods but with much less computational cost.

AB - Simultaneously clustering columns and rows (coclustering) of large data matrix is an important problem with wide applications, such as document mining, microarray analysis, and recommendation systems. Several co-clustering algorithms have been shown effective in discovering hidden clustering structures in the data matrix. For a data matrix of m rows and n columns, the time complexity of these methods is usually in the order of m × n (if not higher). This limits their applicability to data matrices involving a large number of columns and rows. Moreover, an implicit assumption made by existing co-clustering methods is that the whole data matrix needs to be held in the main memory. In this paper, we propose a general framework, CRD, for co-clustering large datasets utilizing recently developed sampling-based matrix decomposition methods. The time complexity of our approach is linear in m and n. And it does not require the whole data matrix be in the main memory. Experimental results show that CRD achieves competitive accuracy to existing co-clustering methods but with much less computational cost.

UR - http://www.scopus.com/inward/record.url?scp=52649158129&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=52649158129&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2008.4497548

DO - 10.1109/ICDE.2008.4497548

M3 - Conference contribution

SN - 9781424418374

T3 - Proceedings - International Conference on Data Engineering

SP - 1337

EP - 1339

BT - Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08

ER -

Pan F, Zhang X, Wang W. A general framework for fast co-clustering on large datasets using matrix decomposition. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE'08. 2008. p. 1337-1339. 4497548. (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDE.2008.4497548