CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset

Jian Wu, Bharath Kandimalla, Shaurya Rohatgi, Athar Sefid, Jianyu Mao, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
EditorsYang Song, Bing Liu, Kisung Lee, Naoki Abe, Calton Pu, Mu Qiao, Nesreen Ahmed, Donald Kossmann, Jeffrey Saltz, Jiliang Tang, Jingrui He, Huan Liu, Xiaohua Hu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5465-5467
Number of pages3
ISBN (Electronic)9781538650356
DOIs
StatePublished - Jan 22 2019
Event2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States
Duration: Dec 10 2018Dec 13 2018

Publication series

NameProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

Conference

Conference2018 IEEE International Conference on Big Data, Big Data 2018
CountryUnited States
CitySeattle
Period12/10/1812/13/18

Fingerprint

Information science
Metadata
Computer science
Learning systems
Neural networks

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Information Systems

Cite this

Wu, J., Kandimalla, B., Rohatgi, S., Sefid, A., Mao, J., & Giles, C. L. (2019). CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset. In Y. Song, B. Liu, K. Lee, N. Abe, C. Pu, M. Qiao, N. Ahmed, D. Kossmann, J. Saltz, J. Tang, J. He, H. Liu, ... X. Hu (Eds.), Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018 (pp. 5465-5467). [8622114] (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2018.8622114
Wu, Jian ; Kandimalla, Bharath ; Rohatgi, Shaurya ; Sefid, Athar ; Mao, Jianyu ; Giles, C. Lee. / CiteSeerX-2018 : A Cleansed Multidisciplinary Scholarly Big Dataset. Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. editor / Yang Song ; Bing Liu ; Kisung Lee ; Naoki Abe ; Calton Pu ; Mu Qiao ; Nesreen Ahmed ; Donald Kossmann ; Jeffrey Saltz ; Jiliang Tang ; Jingrui He ; Huan Liu ; Xiaohua Hu. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 5465-5467 (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018).
@inproceedings{c2cef66d1bd54f42b61a6daaf814a6d2,
title = "CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset",
abstract = "We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences.",
author = "Jian Wu and Bharath Kandimalla and Shaurya Rohatgi and Athar Sefid and Jianyu Mao and Giles, {C. Lee}",
year = "2019",
month = "1",
day = "22",
doi = "10.1109/BigData.2018.8622114",
language = "English (US)",
series = "Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "5465--5467",
editor = "Yang Song and Bing Liu and Kisung Lee and Naoki Abe and Calton Pu and Mu Qiao and Nesreen Ahmed and Donald Kossmann and Jeffrey Saltz and Jiliang Tang and Jingrui He and Huan Liu and Xiaohua Hu",
booktitle = "Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018",
address = "United States",

}

Wu, J, Kandimalla, B, Rohatgi, S, Sefid, A, Mao, J & Giles, CL 2019, CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset. in Y Song, B Liu, K Lee, N Abe, C Pu, M Qiao, N Ahmed, D Kossmann, J Saltz, J Tang, J He, H Liu & X Hu (eds), Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018., 8622114, Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018, Institute of Electrical and Electronics Engineers Inc., pp. 5465-5467, 2018 IEEE International Conference on Big Data, Big Data 2018, Seattle, United States, 12/10/18. https://doi.org/10.1109/BigData.2018.8622114

CiteSeerX-2018 : A Cleansed Multidisciplinary Scholarly Big Dataset. / Wu, Jian; Kandimalla, Bharath; Rohatgi, Shaurya; Sefid, Athar; Mao, Jianyu; Giles, C. Lee.

Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. ed. / Yang Song; Bing Liu; Kisung Lee; Naoki Abe; Calton Pu; Mu Qiao; Nesreen Ahmed; Donald Kossmann; Jeffrey Saltz; Jiliang Tang; Jingrui He; Huan Liu; Xiaohua Hu. Institute of Electrical and Electronics Engineers Inc., 2019. p. 5465-5467 8622114 (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - CiteSeerX-2018

T2 - A Cleansed Multidisciplinary Scholarly Big Dataset

AU - Wu, Jian

AU - Kandimalla, Bharath

AU - Rohatgi, Shaurya

AU - Sefid, Athar

AU - Mao, Jianyu

AU - Giles, C. Lee

PY - 2019/1/22

Y1 - 2019/1/22

N2 - We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences.

AB - We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences.

UR - http://www.scopus.com/inward/record.url?scp=85062620330&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062620330&partnerID=8YFLogxK

U2 - 10.1109/BigData.2018.8622114

DO - 10.1109/BigData.2018.8622114

M3 - Conference contribution

AN - SCOPUS:85062620330

T3 - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

SP - 5465

EP - 5467

BT - Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

A2 - Song, Yang

A2 - Liu, Bing

A2 - Lee, Kisung

A2 - Abe, Naoki

A2 - Pu, Calton

A2 - Qiao, Mu

A2 - Ahmed, Nesreen

A2 - Kossmann, Donald

A2 - Saltz, Jeffrey

A2 - Tang, Jiliang

A2 - He, Jingrui

A2 - Liu, Huan

A2 - Hu, Xiaohua

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Wu J, Kandimalla B, Rohatgi S, Sefid A, Mao J, Giles CL. CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset. In Song Y, Liu B, Lee K, Abe N, Pu C, Qiao M, Ahmed N, Kossmann D, Saltz J, Tang J, He J, Liu H, Hu X, editors, Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018. Institute of Electrical and Electronics Engineers Inc. 2019. p. 5465-5467. 8622114. (Proceedings - 2018 IEEE International Conference on Big Data, Big Data 2018). https://doi.org/10.1109/BigData.2018.8622114