CiteSeerX: 20 Years of service to scholarly big data

Jian Wu, Kunho Kim, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make CiteSeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful.

Original languageEnglish (US)
Title of host publicationProceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450371841
DOIs
StatePublished - May 13 2019
Event2019 Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019 - Pittsburgh, United States
Duration: May 13 2019May 15 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference2019 Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019
CountryUnited States
CityPittsburgh
Period5/13/195/15/19

Fingerprint

Digital libraries
Search engines
Metadata
Big data

All Science Journal Classification (ASJC) codes

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Wu, J., Kim, K., & Lee Giles, C. (2019). CiteSeerX: 20 Years of service to scholarly big data. In Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019 [3359119] (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3359115.3359119
Wu, Jian ; Kim, Kunho ; Lee Giles, C. / CiteSeerX : 20 Years of service to scholarly big data. Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019. Association for Computing Machinery, 2019. (ACM International Conference Proceeding Series).
@inproceedings{2569419958214323b96773a2c996f55f,
title = "CiteSeerX: 20 Years of service to scholarly big data",
abstract = "We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make CiteSeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful.",
author = "Jian Wu and Kunho Kim and {Lee Giles}, C.",
year = "2019",
month = "5",
day = "13",
doi = "10.1145/3359115.3359119",
language = "English (US)",
series = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",
booktitle = "Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019",

}

Wu, J, Kim, K & Lee Giles, C 2019, CiteSeerX: 20 Years of service to scholarly big data. in Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019., 3359119, ACM International Conference Proceeding Series, Association for Computing Machinery, 2019 Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019, Pittsburgh, United States, 5/13/19. https://doi.org/10.1145/3359115.3359119

CiteSeerX : 20 Years of service to scholarly big data. / Wu, Jian; Kim, Kunho; Lee Giles, C.

Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019. Association for Computing Machinery, 2019. 3359119 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - CiteSeerX

T2 - 20 Years of service to scholarly big data

AU - Wu, Jian

AU - Kim, Kunho

AU - Lee Giles, C.

PY - 2019/5/13

Y1 - 2019/5/13

N2 - We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make CiteSeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful.

AB - We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make CiteSeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful.

UR - http://www.scopus.com/inward/record.url?scp=85073801149&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85073801149&partnerID=8YFLogxK

U2 - 10.1145/3359115.3359119

DO - 10.1145/3359115.3359119

M3 - Conference contribution

AN - SCOPUS:85073801149

T3 - ACM International Conference Proceeding Series

BT - Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019

PB - Association for Computing Machinery

ER -

Wu J, Kim K, Lee Giles C. CiteSeerX: 20 Years of service to scholarly big data. In Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019. Association for Computing Machinery. 2019. 3359119. (ACM International Conference Proceeding Series). https://doi.org/10.1145/3359115.3359119