An architecture for information extraction from figures in digital libraries

Sagnik Ray Choudhury, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Scopus citations

Abstract

Scholarly documents contain multiple gures representing experimental findings. These gures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such gures. Our architecture consists of the following modules: 1. An ex- tractor for gures and associated metadata ( gure captions and mentions) from PDF documents; 2. A Search engine on the extracted gures and metadata; 3. An image processing module for automated data extraction from the gures and 4. A natural language processing module to understand the semantics of the gure. We discuss the challenges in each step, report an extractor algorithm to extract vector graph- ics from scholarly documents and aspecification algorithm for gures. Our extractor algorithm improves the state of the art by more than 10% and thespecification process is very scalable, yet achieves 85% accuracy. We also describe a semi-automatic system for data extraction from gures which is integrated with our search engine to improve user experience.

Original languageEnglish (US)
Title of host publicationWWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
PublisherAssociation for Computing Machinery, Inc
Pages667-672
Number of pages6
ISBN (Electronic)9781450334730
DOIs
StatePublished - May 18 2015
Event24th International Conference on World Wide Web, WWW 2015 - Florence, Italy
Duration: May 18 2015May 22 2015

Publication series

NameWWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web

Other

Other24th International Conference on World Wide Web, WWW 2015
CountryItaly
CityFlorence
Period5/18/155/22/15

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Fingerprint Dive into the research topics of 'An architecture for information extraction from figures in digital libraries'. Together they form a unique fingerprint.

Cite this