Distributed entity resolution based on similarity join for large-scale data clustering

Tiezheng Nie, Wang-chien Lee, Derong Shen, Ge Yu, Yue Kou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Entity resolution has been widely used in data mining applications to find similar records. However, the increasing scale and complexity of data has restricted the performance of entity resolution. In this paper, we propose a novel entity resolution framework that clusters large-scale data with distributed entity resolution method. We model the clustering problem as finding similarity sub connected graphs from records. Firstly, our approach finds pairs of records whose similarities are above a given threshold based on appjoin algorithm which extends the ppjoin algorithm and are executed on MapReduce framework. Then, we propose a cache-based algorithm which cluster entities with similar pairs based on the Disjoin Set algorithm and are also designed for MapReduce framework. Experimental results on real dataset show that our algorithms can achieve more efficiency than previous algorithms on the entity resolution and clustering.

Original languageEnglish (US)
Title of host publicationWeb-Age Information Management - 15th International Conference, WAIM 2014, Proceedings
PublisherSpringer Verlag
Pages138-149
Number of pages12
ISBN (Print)9783319080093
DOIs
StatePublished - Jan 1 2014
Event15th International Conference on Web-Age Information Management, WAIM 2014 - Macau, China
Duration: Jun 16 2014Jun 18 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8485 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other15th International Conference on Web-Age Information Management, WAIM 2014
CountryChina
CityMacau
Period6/16/146/18/14

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Distributed entity resolution based on similarity join for large-scale data clustering'. Together they form a unique fingerprint.

  • Cite this

    Nie, T., Lee, W., Shen, D., Yu, G., & Kou, Y. (2014). Distributed entity resolution based on similarity join for large-scale data clustering. In Web-Age Information Management - 15th International Conference, WAIM 2014, Proceedings (pp. 138-149). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 8485 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-08010-9_16