CiteSeerx: A scholarly big dataset

Cornelia Caragea, Jian Wu, Alina Ciobanu, Kyle Williams, Juan Fernández-Ramírez, Hung Hsuan Chen, Zhaohui Wu, Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

36 Scopus citations

Abstract

The CiteSeer x digital library stores and indexes research articles in Computer Science and related fields. Although its main purpose is to make it easier for researchers to search for scientific information, CiteSeer x has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeer x is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata. Since the performance of models trained on these data highly depends on the quality of the data, we propose an approach to CiteSeer x metadata cleaning that incorporates information from an external data source. The result is a subset of CiteSeer x, which is substantially cleaner than the entire set. Our goal is to make the new dataset available to the research community to facilitate future work in Information Retrieval.

Original languageEnglish (US)
Title of host publicationAdvances in Information Retrieval - 36th European Conference on IR Research, ECIR 2014, Proceedings
PublisherSpringer Verlag
Pages311-322
Number of pages12
ISBN (Print)9783319060279
DOIs
StatePublished - Jan 1 2014
Event36th European Conference on Information Retrieval, ECIR 2014 - Amsterdam, Netherlands
Duration: Apr 13 2014Apr 16 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8416 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other36th European Conference on Information Retrieval, ECIR 2014
Country/TerritoryNetherlands
CityAmsterdam
Period4/13/144/16/14

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'CiteSeer<sup>x</sup>: A scholarly big dataset'. Together they form a unique fingerprint.

Cite this