CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset

Jian Wu, Bharath Kandimalla, Shaurya Rohatgi, Athar Sefid, Jianyu Mao, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
EditorsYang Song, Bing Liu, Kisung Lee, Naoki Abe, Calton Pu, Mu Qiao, Nesreen Ahmed, Donald Kossmann, Jeffrey Saltz, Jiliang Tang, Jingrui He, Huan Liu, Xiaohua Hu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5465-5467
Number of pages3
ISBN (Electronic)9781538650356
DOIs
StatePublished - Jan 22 2019
Event2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States
Duration: Dec 10 2018Dec 13 2018

Publication series

NameProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

Conference

Conference2018 IEEE International Conference on Big Data, Big Data 2018
CountryUnited States
CitySeattle
Period12/10/1812/13/18

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Information Systems

Fingerprint Dive into the research topics of 'CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset'. Together they form a unique fingerprint.

Cite this