CiteSeerX: 20 Years of service to scholarly big data

Jian Wu, Kunho Kim, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make CiteSeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful.

Original languageEnglish (US)
Title of host publicationProceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450371841
DOIs
StatePublished - May 13 2019
Event2019 Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019 - Pittsburgh, United States
Duration: May 13 2019May 15 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2019 Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019
CountryUnited States
CityPittsburgh
Period5/13/195/15/19

All Science Journal Classification (ASJC) codes

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'CiteSeerX: 20 Years of service to scholarly big data'. Together they form a unique fingerprint.

  • Cite this

    Wu, J., Kim, K., & Lee Giles, C. (2019). CiteSeerX: 20 Years of service to scholarly big data. In Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019 [3359119] (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3359115.3359119