A web service for scholarly big data information extraction

Kyle Williams, Lichi Li, Madian Khabsa, Jian Wu, Patrick C. Shih, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

14 Scopus citations

Abstract

The automatic extraction of metadata and other information from scholarly documents is a common task in academic digital libraries, search engines, and document management systems to allow for the management and categorization of documents and for search to take place. A Web-accessible API can simplify this extraction by providing a single point of operation for extraction that can be incorporated into multiple document workflows without the need for each workflow to implement and support its own extraction functionality. In this paper, we describe CiteSeerExtractor, a RESTful API for scholarly information extraction that exploits the fact that there is duplication in scholarly big data and makes use of a near duplicate matching backend. The backend stores previously extracted metadata and avoids extracting metadata from a document if it has already been extracted before. We describe the design, implementation, and functionality of CiteSeerExtractor and show how the duplicate document matching results in a difference of 8.46% in the time required to extract header and citation information from approximately 3.5 million documents compared to a baseline.

Original languageEnglish (US)
Title of host publicationProceedings - 2014 IEEE International Conference on Web Services, ICWS 2014
EditorsDavid De Roure, Bhavani Thuraisingham, Jia Zhang
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages105-112
Number of pages8
ISBN (Electronic)9781479950546
DOIs
StatePublished - 2014
Event2014 21st IEEE International Conference on Web Services, ICWS 2014 - Anchorage, United States
Duration: Jun 27 2014Jul 2 2014

Publication series

NameProceedings - 2014 IEEE International Conference on Web Services, ICWS 2014

Other

Other2014 21st IEEE International Conference on Web Services, ICWS 2014
CountryUnited States
CityAnchorage
Period6/27/147/2/14

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Human-Computer Interaction
  • Software

Fingerprint Dive into the research topics of 'A web service for scholarly big data information extraction'. Together they form a unique fingerprint.

Cite this