TY - GEN
T1 - An architecture for information extraction from figures in digital libraries
AU - Choudhury, Sagnik Ray
AU - Giles, C. Lee
PY - 2015/5/18
Y1 - 2015/5/18
N2 - Scholarly documents contain multiple gures representing experimental findings. These gures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such gures. Our architecture consists of the following modules: 1. An ex- tractor for gures and associated metadata ( gure captions and mentions) from PDF documents; 2. A Search engine on the extracted gures and metadata; 3. An image processing module for automated data extraction from the gures and 4. A natural language processing module to understand the semantics of the gure. We discuss the challenges in each step, report an extractor algorithm to extract vector graph- ics from scholarly documents and aspecification algorithm for gures. Our extractor algorithm improves the state of the art by more than 10% and thespecification process is very scalable, yet achieves 85% accuracy. We also describe a semi-automatic system for data extraction from gures which is integrated with our search engine to improve user experience.
AB - Scholarly documents contain multiple gures representing experimental findings. These gures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such gures. Our architecture consists of the following modules: 1. An ex- tractor for gures and associated metadata ( gure captions and mentions) from PDF documents; 2. A Search engine on the extracted gures and metadata; 3. An image processing module for automated data extraction from the gures and 4. A natural language processing module to understand the semantics of the gure. We discuss the challenges in each step, report an extractor algorithm to extract vector graph- ics from scholarly documents and aspecification algorithm for gures. Our extractor algorithm improves the state of the art by more than 10% and thespecification process is very scalable, yet achieves 85% accuracy. We also describe a semi-automatic system for data extraction from gures which is integrated with our search engine to improve user experience.
UR - http://www.scopus.com/inward/record.url?scp=84944347856&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84944347856&partnerID=8YFLogxK
U2 - 10.1145/2740908.2741712
DO - 10.1145/2740908.2741712
M3 - Conference contribution
AN - SCOPUS:84944347856
T3 - WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
SP - 667
EP - 672
BT - WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
PB - Association for Computing Machinery, Inc
T2 - 24th International Conference on World Wide Web, WWW 2015
Y2 - 18 May 2015 through 22 May 2015
ER -