Table of contents recognition and extraction for heterogeneous book documents

Research output: Contribution to journalConference article

12 Citations (Scopus)

Abstract

Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely ''flat'', ''ordered'', and ''divided'', giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.

Original languageEnglish (US)
Article number6628805
Pages (from-to)1205-1209
Number of pages5
JournalProceedings of the International Conference on Document Analysis and Recognition, ICDAR
DOIs
StatePublished - Dec 11 2013
Event12th International Conference on Document Analysis and Recognition, ICDAR 2013 - Washington, DC, United States
Duration: Aug 25 2013Aug 28 2013

All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition

Cite this

@article{3265f4401dd14722965c127cb735209f,
title = "Table of contents recognition and extraction for heterogeneous book documents",
abstract = "Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely ''flat'', ''ordered'', and ''divided'', giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.",
author = "Zhaohui Wu and Prasenjit Mitra and Giles, {C. Lee}",
year = "2013",
month = "12",
day = "11",
doi = "10.1109/ICDAR.2013.244",
language = "English (US)",
pages = "1205--1209",
journal = "Proceedings of the International Conference on Document Analysis and Recognition, ICDAR",
issn = "1520-5363",

}

TY - JOUR

T1 - Table of contents recognition and extraction for heterogeneous book documents

AU - Wu, Zhaohui

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2013/12/11

Y1 - 2013/12/11

N2 - Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely ''flat'', ''ordered'', and ''divided'', giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.

AB - Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely ''flat'', ''ordered'', and ''divided'', giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.

UR - http://www.scopus.com/inward/record.url?scp=84889577640&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889577640&partnerID=8YFLogxK

U2 - 10.1109/ICDAR.2013.244

DO - 10.1109/ICDAR.2013.244

M3 - Conference article

AN - SCOPUS:84889577640

SP - 1205

EP - 1209

JO - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

JF - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR

SN - 1520-5363

M1 - 6628805

ER -