PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search

Jian Wu, Jason Killian, Huaiyu Yang, Kyle Williams, Sagnik Ray Choudhury, Suppawong Tuarob, Cornelia Caragea, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Citations (Scopus)

Abstract

We introduce PDFMEF, a multi-entity knowledge extrac- tion framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open- source extraction tools. Currently, it leverages PDFBox and TET for full text extraction, the scholarly document filter described in [5] for document classification, GROBID for header extraction, ParsCit for citation extraction, PDFFig- ures for figure and table extraction, and algorithm extrac- tion [27]. While it can be run as a whole, the extraction tool in each module is highly customizable. Users can sub- stitute default extractors with other extraction tools they prefer by writing a thin wrapper to implement the abstracts. The framework is designed to be scalable and is capable of running in parallel using a multi-processing technique in Python. Experiments indicate that the system with default setups is CPU bounded, and leaves a small footprint in the memory, which makes it best to run on a multi-core machine. The best performance using a dedicated server of 16 cores takes 1.3 seconds on average to process one PDF document. It is used to index extracted information and help users to quickly locate relevant results in published scholarly docu- ments and to eficiently construct a large knowledge base in order to build a semantic scholarly search engine. Part of it is running on CiteSeerX digital library search engine.

Original languageEnglish (US)
Title of host publicationProceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450338493
DOIs
StatePublished - Oct 7 2015
Event8th International Conference on Knowledge Capture, K-CAP 2015 - Palisades, United States
Duration: Oct 7 2015Oct 10 2015

Publication series

NameProceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015

Other

Other8th International Conference on Knowledge Capture, K-CAP 2015
CountryUnited States
CityPalisades
Period10/7/1510/10/15

Fingerprint

Semantics
Search engines
Digital libraries
Program processors
Servers
Data storage equipment
Processing
Experiments

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Information Systems
  • Software

Cite this

Wu, J., Killian, J., Yang, H., Williams, K., Choudhury, S. R., Tuarob, S., ... Giles, C. L. (2015). PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015 [13] (Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015). Association for Computing Machinery, Inc. https://doi.org/10.1145/2815833.2815834
Wu, Jian ; Killian, Jason ; Yang, Huaiyu ; Williams, Kyle ; Choudhury, Sagnik Ray ; Tuarob, Suppawong ; Caragea, Cornelia ; Giles, C. Lee. / PDFMEF : A multi-entity knowledge extraction framework for scholarly documents and semantic search. Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015. Association for Computing Machinery, Inc, 2015. (Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015).
@inproceedings{03c595b3d03341c6be35a4be1d0add74,
title = "PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search",
abstract = "We introduce PDFMEF, a multi-entity knowledge extrac- tion framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open- source extraction tools. Currently, it leverages PDFBox and TET for full text extraction, the scholarly document filter described in [5] for document classification, GROBID for header extraction, ParsCit for citation extraction, PDFFig- ures for figure and table extraction, and algorithm extrac- tion [27]. While it can be run as a whole, the extraction tool in each module is highly customizable. Users can sub- stitute default extractors with other extraction tools they prefer by writing a thin wrapper to implement the abstracts. The framework is designed to be scalable and is capable of running in parallel using a multi-processing technique in Python. Experiments indicate that the system with default setups is CPU bounded, and leaves a small footprint in the memory, which makes it best to run on a multi-core machine. The best performance using a dedicated server of 16 cores takes 1.3 seconds on average to process one PDF document. It is used to index extracted information and help users to quickly locate relevant results in published scholarly docu- ments and to eficiently construct a large knowledge base in order to build a semantic scholarly search engine. Part of it is running on CiteSeerX digital library search engine.",
author = "Jian Wu and Jason Killian and Huaiyu Yang and Kyle Williams and Choudhury, {Sagnik Ray} and Suppawong Tuarob and Cornelia Caragea and Giles, {C. Lee}",
year = "2015",
month = "10",
day = "7",
doi = "10.1145/2815833.2815834",
language = "English (US)",
series = "Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015",
publisher = "Association for Computing Machinery, Inc",
booktitle = "Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015",

}

Wu, J, Killian, J, Yang, H, Williams, K, Choudhury, SR, Tuarob, S, Caragea, C & Giles, CL 2015, PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search. in Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015., 13, Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015, Association for Computing Machinery, Inc, 8th International Conference on Knowledge Capture, K-CAP 2015, Palisades, United States, 10/7/15. https://doi.org/10.1145/2815833.2815834

PDFMEF : A multi-entity knowledge extraction framework for scholarly documents and semantic search. / Wu, Jian; Killian, Jason; Yang, Huaiyu; Williams, Kyle; Choudhury, Sagnik Ray; Tuarob, Suppawong; Caragea, Cornelia; Giles, C. Lee.

Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015. Association for Computing Machinery, Inc, 2015. 13 (Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - PDFMEF

T2 - A multi-entity knowledge extraction framework for scholarly documents and semantic search

AU - Wu, Jian

AU - Killian, Jason

AU - Yang, Huaiyu

AU - Williams, Kyle

AU - Choudhury, Sagnik Ray

AU - Tuarob, Suppawong

AU - Caragea, Cornelia

AU - Giles, C. Lee

PY - 2015/10/7

Y1 - 2015/10/7

N2 - We introduce PDFMEF, a multi-entity knowledge extrac- tion framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open- source extraction tools. Currently, it leverages PDFBox and TET for full text extraction, the scholarly document filter described in [5] for document classification, GROBID for header extraction, ParsCit for citation extraction, PDFFig- ures for figure and table extraction, and algorithm extrac- tion [27]. While it can be run as a whole, the extraction tool in each module is highly customizable. Users can sub- stitute default extractors with other extraction tools they prefer by writing a thin wrapper to implement the abstracts. The framework is designed to be scalable and is capable of running in parallel using a multi-processing technique in Python. Experiments indicate that the system with default setups is CPU bounded, and leaves a small footprint in the memory, which makes it best to run on a multi-core machine. The best performance using a dedicated server of 16 cores takes 1.3 seconds on average to process one PDF document. It is used to index extracted information and help users to quickly locate relevant results in published scholarly docu- ments and to eficiently construct a large knowledge base in order to build a semantic scholarly search engine. Part of it is running on CiteSeerX digital library search engine.

AB - We introduce PDFMEF, a multi-entity knowledge extrac- tion framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open- source extraction tools. Currently, it leverages PDFBox and TET for full text extraction, the scholarly document filter described in [5] for document classification, GROBID for header extraction, ParsCit for citation extraction, PDFFig- ures for figure and table extraction, and algorithm extrac- tion [27]. While it can be run as a whole, the extraction tool in each module is highly customizable. Users can sub- stitute default extractors with other extraction tools they prefer by writing a thin wrapper to implement the abstracts. The framework is designed to be scalable and is capable of running in parallel using a multi-processing technique in Python. Experiments indicate that the system with default setups is CPU bounded, and leaves a small footprint in the memory, which makes it best to run on a multi-core machine. The best performance using a dedicated server of 16 cores takes 1.3 seconds on average to process one PDF document. It is used to index extracted information and help users to quickly locate relevant results in published scholarly docu- ments and to eficiently construct a large knowledge base in order to build a semantic scholarly search engine. Part of it is running on CiteSeerX digital library search engine.

UR - http://www.scopus.com/inward/record.url?scp=84997208457&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84997208457&partnerID=8YFLogxK

U2 - 10.1145/2815833.2815834

DO - 10.1145/2815833.2815834

M3 - Conference contribution

AN - SCOPUS:84997208457

T3 - Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015

BT - Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015

PB - Association for Computing Machinery, Inc

ER -

Wu J, Killian J, Yang H, Williams K, Choudhury SR, Tuarob S et al. PDFMEF: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015. Association for Computing Machinery, Inc. 2015. 13. (Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015). https://doi.org/10.1145/2815833.2815834