TY - GEN
T1 - Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines
AU - Liu, Ying
AU - Bai, Kun
AU - Mitra, Prasenjit
AU - Giles, C. Lee
PY - 2009
Y1 - 2009
N2 - As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.
AB - As the rapid growth of PDF documents, recognizing the document structure and components are useful for document storage, classification and retrieval. Table, a ubiquitous document component, becomes an important information source. Accurately detecting the table boundary plays a crucial role for many applications, e.g., the increasing demand on the table data search. Rather than converting PDFs to image or HTML and then processing with other techniques (e.g., OCR), extracting and analyzing texts from PDFs directly is easy and accurate. However, text extraction tools face a common problem: text sequence error. In this paper, we propose two algorithms to recover the sequence of extracted sparse lines, which improve the table content collection. The experimental results show the comparison of the performance of both algorithms, and demonstrate the effectiveness of text sequence recovering for the table boundary detection.
UR - http://www.scopus.com/inward/record.url?scp=71249084337&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=71249084337&partnerID=8YFLogxK
U2 - 10.1109/ICDAR.2009.138
DO - 10.1109/ICDAR.2009.138
M3 - Conference contribution
AN - SCOPUS:71249084337
SN - 9780769537252
T3 - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
SP - 1006
EP - 1010
BT - ICDAR2009 - 10th International Conference on Document Analysis and Recognition
T2 - ICDAR2009 - 10th International Conference on Document Analysis and Recognition
Y2 - 26 July 2009 through 29 July 2009
ER -