Figure metadata extraction from digital documents

Sagnik Ray Choudhury, Prasenjit Mitra, Andi Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, C. Lee Giles

Research output: Contribution to journalConference article

25 Scopus citations

Abstract

Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.

Original languageEnglish (US)
Article number6628599
Pages (from-to)135-139
Number of pages5
JournalProceedings of the International Conference on Document Analysis and Recognition, ICDAR
DOIs
StatePublished - Dec 11 2013
Event12th International Conference on Document Analysis and Recognition, ICDAR 2013 - Washington, DC, United States
Duration: Aug 25 2013Aug 28 2013

    Fingerprint

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition

Cite this