Automatic extraction of figures from scholarly documents

Sagnik Ray Choudhury, Prasenjit Mitra, Clyde Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Scopus citations

Abstract

Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

Original languageEnglish (US)
Title of host publicationDocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
Pages47-50
Number of pages4
ISBN (Electronic)9781450333078
DOIs
StatePublished - Sep 8 2015
EventACM Symposium on Document Engineering, DocEng 2015 - Lausanne, Switzerland
Duration: Sep 8 2015Sep 11 2015

Publication series

NameDocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering

Other

OtherACM Symposium on Document Engineering, DocEng 2015
CountrySwitzerland
CityLausanne
Period9/8/159/11/15

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Software

Fingerprint Dive into the research topics of 'Automatic extraction of figures from scholarly documents'. Together they form a unique fingerprint.

Cite this