TY - GEN
T1 - CiteSeerX
T2 - 2019 Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019
AU - Wu, Jian
AU - Kim, Kunho
AU - Lee Giles, C.
N1 - Funding Information:
We gratefully acknowledge partial support from National Science Foundation and thank the reviewers for constructive comments.
Publisher Copyright:
© 2019 held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/5/13
Y1 - 2019/5/13
N2 - We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make CiteSeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful.
AB - We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make CiteSeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful.
UR - http://www.scopus.com/inward/record.url?scp=85073801149&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073801149&partnerID=8YFLogxK
U2 - 10.1145/3359115.3359119
DO - 10.1145/3359115.3359119
M3 - Conference contribution
AN - SCOPUS:85073801149
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the Conference on Artificial Intelligence for Data Discovery and Reuse, AIDR 2019
PB - Association for Computing Machinery
Y2 - 13 May 2019 through 15 May 2019
ER -