Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Scopus citations

Abstract

As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.

Original languageEnglish (US)
Title of host publicationICDAR2009 - 10th International Conference on Document Analysis and Recognition
Pages1006-1010
Number of pages5
DOIs
StatePublished - 2009
EventICDAR2009 - 10th International Conference on Document Analysis and Recognition - Barcelona, Spain
Duration: Jul 26 2009Jul 29 2009

Publication series

NameProceedings of the International Conference on Document Analysis and Recognition, ICDAR
ISSN (Print)1520-5363

Other

OtherICDAR2009 - 10th International Conference on Document Analysis and Recognition
CountrySpain
CityBarcelona
Period7/26/097/29/09

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines'. Together they form a unique fingerprint.

Cite this