Automatic extraction of figures from scholarly documents

Sagnik Ray Choudhury, Prasenjit Mitra, Clyde Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

Original languageEnglish (US)
Title of host publicationDocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
Pages47-50
Number of pages4
ISBN (Electronic)9781450333078
DOIs
StatePublished - Sep 8 2015
EventACM Symposium on Document Engineering, DocEng 2015 - Lausanne, Switzerland
Duration: Sep 8 2015Sep 11 2015

Publication series

NameDocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering

Other

OtherACM Symposium on Document Engineering, DocEng 2015
CountrySwitzerland
CityLausanne
Period9/8/159/11/15

Fingerprint

Semantics

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Software

Cite this

Choudhury, S. R., Mitra, P., & Giles, C. L. (2015). Automatic extraction of figures from scholarly documents. In DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering (pp. 47-50). (DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering). Association for Computing Machinery, Inc. https://doi.org/10.1145/2682571.2797085
Choudhury, Sagnik Ray ; Mitra, Prasenjit ; Giles, Clyde Lee. / Automatic extraction of figures from scholarly documents. DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc, 2015. pp. 47-50 (DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering).
@inproceedings{60ae528ef9104ab0a39490d69a749586,
title = "Automatic extraction of figures from scholarly documents",
abstract = "Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple {"}figures{"} such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80{\%}.",
author = "Choudhury, {Sagnik Ray} and Prasenjit Mitra and Giles, {Clyde Lee}",
year = "2015",
month = "9",
day = "8",
doi = "10.1145/2682571.2797085",
language = "English (US)",
series = "DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering",
publisher = "Association for Computing Machinery, Inc",
pages = "47--50",
booktitle = "DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering",

}

Choudhury, SR, Mitra, P & Giles, CL 2015, Automatic extraction of figures from scholarly documents. in DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering, Association for Computing Machinery, Inc, pp. 47-50, ACM Symposium on Document Engineering, DocEng 2015, Lausanne, Switzerland, 9/8/15. https://doi.org/10.1145/2682571.2797085

Automatic extraction of figures from scholarly documents. / Choudhury, Sagnik Ray; Mitra, Prasenjit; Giles, Clyde Lee.

DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc, 2015. p. 47-50 (DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Automatic extraction of figures from scholarly documents

AU - Choudhury, Sagnik Ray

AU - Mitra, Prasenjit

AU - Giles, Clyde Lee

PY - 2015/9/8

Y1 - 2015/9/8

N2 - Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

AB - Scholarly papers (journal and conference papers, technical reports, etc.) usually contain multiple "figures" such as plots, flow charts and other images which are generated manually to symbolically represent and illustrate visually important concepts, findings and results. These figures can be analyzed for automated data extraction or semantic analysis. Surprisingly, large scale automated extraction of such figures from PDF documents has received little attention. Here we discuss the challenges of how to build a heuristic independent trainable model for such an extraction task and how to extract figures at scale. Motivated by recent developments in table extraction, we define three new evaluation metrics: figure-precision, figure-recall, and figure-F1-score. Our dataset consists of a sample of 200 PDFs, randomly collected from five million scholarly PDFs and manually tagged for 180 figure locations. Initial results from our work demonstrate an accuracy greater than 80%.

UR - http://www.scopus.com/inward/record.url?scp=84959235832&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959235832&partnerID=8YFLogxK

U2 - 10.1145/2682571.2797085

DO - 10.1145/2682571.2797085

M3 - Conference contribution

AN - SCOPUS:84959235832

T3 - DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering

SP - 47

EP - 50

BT - DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering

PB - Association for Computing Machinery, Inc

ER -

Choudhury SR, Mitra P, Giles CL. Automatic extraction of figures from scholarly documents. In DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering. Association for Computing Machinery, Inc. 2015. p. 47-50. (DocEng 2015 - Proceedings of the 2015 ACM Symposium on Document Engineering). https://doi.org/10.1145/2682571.2797085