Linked document embedding for classification

Suhang Wang, Jiliang Tang, Charu Aggarwal, Huan Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

37 Citations (Scopus)

Abstract

Word and document embedding algorithms such as Skip-gram and Paragraph Vector have been proven to help various text analysis tasks such as document classification, document clustering and information retrieval. The vast majority of these algorithms are designed to work with independent and identically distributed documents. However, in many real-world applications, documents are inherently linked. For example, web documents such as blogs and online news often have hyperlinks to other web documents, and scientific articles usually cite other articles. Linked documents present new challenges to traditional document embedding algorithms. In addition, most existing document embedding algorithms are unsupervised and their learned representations may not be optimal for classification when labeling information is available. In this paper, we study the problem of linked document embedding for classification and propose a linked document embedding framework LDE, which combines link and label information with content information to learn document representations for classification. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the importance of link and label information in the proposed framework LDE.

Original languageEnglish (US)
Title of host publicationCIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages115-124
Number of pages10
ISBN (Electronic)9781450340731
DOIs
StatePublished - Oct 24 2016
Event25th ACM International Conference on Information and Knowledge Management, CIKM 2016 - Indianapolis, United States
Duration: Oct 24 2016Oct 28 2016

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings
Volume24-28-October-2016

Other

Other25th ACM International Conference on Information and Knowledge Management, CIKM 2016
CountryUnited States
CityIndianapolis
Period10/24/1610/28/16

Fingerprint

World Wide Web
Text analysis
Information content
Document clustering
Document classification
Blogs
Labeling
News
Information retrieval
Experiment

All Science Journal Classification (ASJC) codes

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Wang, S., Tang, J., Aggarwal, C., & Liu, H. (2016). Linked document embedding for classification. In CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management (pp. 115-124). (International Conference on Information and Knowledge Management, Proceedings; Vol. 24-28-October-2016). Association for Computing Machinery. https://doi.org/10.1145/2983323.2983755
Wang, Suhang ; Tang, Jiliang ; Aggarwal, Charu ; Liu, Huan. / Linked document embedding for classification. CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. Association for Computing Machinery, 2016. pp. 115-124 (International Conference on Information and Knowledge Management, Proceedings).
@inproceedings{da9025aab6684558831ad0039533bbfb,
title = "Linked document embedding for classification",
abstract = "Word and document embedding algorithms such as Skip-gram and Paragraph Vector have been proven to help various text analysis tasks such as document classification, document clustering and information retrieval. The vast majority of these algorithms are designed to work with independent and identically distributed documents. However, in many real-world applications, documents are inherently linked. For example, web documents such as blogs and online news often have hyperlinks to other web documents, and scientific articles usually cite other articles. Linked documents present new challenges to traditional document embedding algorithms. In addition, most existing document embedding algorithms are unsupervised and their learned representations may not be optimal for classification when labeling information is available. In this paper, we study the problem of linked document embedding for classification and propose a linked document embedding framework LDE, which combines link and label information with content information to learn document representations for classification. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the importance of link and label information in the proposed framework LDE.",
author = "Suhang Wang and Jiliang Tang and Charu Aggarwal and Huan Liu",
year = "2016",
month = "10",
day = "24",
doi = "10.1145/2983323.2983755",
language = "English (US)",
series = "International Conference on Information and Knowledge Management, Proceedings",
publisher = "Association for Computing Machinery",
pages = "115--124",
booktitle = "CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management",

}

Wang, S, Tang, J, Aggarwal, C & Liu, H 2016, Linked document embedding for classification. in CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. International Conference on Information and Knowledge Management, Proceedings, vol. 24-28-October-2016, Association for Computing Machinery, pp. 115-124, 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, United States, 10/24/16. https://doi.org/10.1145/2983323.2983755

Linked document embedding for classification. / Wang, Suhang; Tang, Jiliang; Aggarwal, Charu; Liu, Huan.

CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. Association for Computing Machinery, 2016. p. 115-124 (International Conference on Information and Knowledge Management, Proceedings; Vol. 24-28-October-2016).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Linked document embedding for classification

AU - Wang, Suhang

AU - Tang, Jiliang

AU - Aggarwal, Charu

AU - Liu, Huan

PY - 2016/10/24

Y1 - 2016/10/24

N2 - Word and document embedding algorithms such as Skip-gram and Paragraph Vector have been proven to help various text analysis tasks such as document classification, document clustering and information retrieval. The vast majority of these algorithms are designed to work with independent and identically distributed documents. However, in many real-world applications, documents are inherently linked. For example, web documents such as blogs and online news often have hyperlinks to other web documents, and scientific articles usually cite other articles. Linked documents present new challenges to traditional document embedding algorithms. In addition, most existing document embedding algorithms are unsupervised and their learned representations may not be optimal for classification when labeling information is available. In this paper, we study the problem of linked document embedding for classification and propose a linked document embedding framework LDE, which combines link and label information with content information to learn document representations for classification. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the importance of link and label information in the proposed framework LDE.

AB - Word and document embedding algorithms such as Skip-gram and Paragraph Vector have been proven to help various text analysis tasks such as document classification, document clustering and information retrieval. The vast majority of these algorithms are designed to work with independent and identically distributed documents. However, in many real-world applications, documents are inherently linked. For example, web documents such as blogs and online news often have hyperlinks to other web documents, and scientific articles usually cite other articles. Linked documents present new challenges to traditional document embedding algorithms. In addition, most existing document embedding algorithms are unsupervised and their learned representations may not be optimal for classification when labeling information is available. In this paper, we study the problem of linked document embedding for classification and propose a linked document embedding framework LDE, which combines link and label information with content information to learn document representations for classification. Experimental results on real-world datasets demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the importance of link and label information in the proposed framework LDE.

UR - http://www.scopus.com/inward/record.url?scp=84996497473&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84996497473&partnerID=8YFLogxK

U2 - 10.1145/2983323.2983755

DO - 10.1145/2983323.2983755

M3 - Conference contribution

AN - SCOPUS:84996497473

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 115

EP - 124

BT - CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management

PB - Association for Computing Machinery

ER -

Wang S, Tang J, Aggarwal C, Liu H. Linked document embedding for classification. In CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. Association for Computing Machinery. 2016. p. 115-124. (International Conference on Information and Knowledge Management, Proceedings). https://doi.org/10.1145/2983323.2983755