Learning to extract semantic structure from documents using multimodal fully convolutional neural networks

Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, Clyde Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

We present an end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images. We consider document semantic structure extraction as a pixel-wise segmentation task, and propose a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. Moreover, we propose an efficient synthetic document generation process that we use to generate pretraining data for our network. Once the network is trained on a large set of synthetic documents, we fine-tune the network on unlabeled real documents using a semi-supervised approach. We systematically study the optimum network architecture and show that both our multimodal approach and the synthetic data pretraining significantly boost the performance.

Original languageEnglish (US)
Title of host publicationProceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4342-4351
Number of pages10
ISBN (Electronic)9781538604571
DOIs
StatePublished - Nov 6 2017
Event30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 - Honolulu, United States
Duration: Jul 21 2017Jul 26 2017

Publication series

NameProceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
Volume2017-January

Other

Other30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
CountryUnited States
CityHonolulu
Period7/21/177/26/17

Fingerprint

Pixels
Semantics
Neural networks
Network architecture

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Computer Vision and Pattern Recognition

Cite this

Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., & Giles, C. L. (2017). Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 (pp. 4342-4351). (Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017; Vol. 2017-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CVPR.2017.462
Yang, Xiao ; Yumer, Ersin ; Asente, Paul ; Kraley, Mike ; Kifer, Daniel ; Giles, Clyde Lee. / Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 4342-4351 (Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017).
@inproceedings{cb7375c68ec24df69b54aedc5fe6e128,
title = "Learning to extract semantic structure from documents using multimodal fully convolutional neural networks",
abstract = "We present an end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images. We consider document semantic structure extraction as a pixel-wise segmentation task, and propose a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. Moreover, we propose an efficient synthetic document generation process that we use to generate pretraining data for our network. Once the network is trained on a large set of synthetic documents, we fine-tune the network on unlabeled real documents using a semi-supervised approach. We systematically study the optimum network architecture and show that both our multimodal approach and the synthetic data pretraining significantly boost the performance.",
author = "Xiao Yang and Ersin Yumer and Paul Asente and Mike Kraley and Daniel Kifer and Giles, {Clyde Lee}",
year = "2017",
month = "11",
day = "6",
doi = "10.1109/CVPR.2017.462",
language = "English (US)",
series = "Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "4342--4351",
booktitle = "Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017",
address = "United States",

}

Yang, X, Yumer, E, Asente, P, Kraley, M, Kifer, D & Giles, CL 2017, Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, Institute of Electrical and Electronics Engineers Inc., pp. 4342-4351, 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, United States, 7/21/17. https://doi.org/10.1109/CVPR.2017.462

Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. / Yang, Xiao; Yumer, Ersin; Asente, Paul; Kraley, Mike; Kifer, Daniel; Giles, Clyde Lee.

Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 4342-4351 (Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017; Vol. 2017-January).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Learning to extract semantic structure from documents using multimodal fully convolutional neural networks

AU - Yang, Xiao

AU - Yumer, Ersin

AU - Asente, Paul

AU - Kraley, Mike

AU - Kifer, Daniel

AU - Giles, Clyde Lee

PY - 2017/11/6

Y1 - 2017/11/6

N2 - We present an end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images. We consider document semantic structure extraction as a pixel-wise segmentation task, and propose a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. Moreover, we propose an efficient synthetic document generation process that we use to generate pretraining data for our network. Once the network is trained on a large set of synthetic documents, we fine-tune the network on unlabeled real documents using a semi-supervised approach. We systematically study the optimum network architecture and show that both our multimodal approach and the synthetic data pretraining significantly boost the performance.

AB - We present an end-to-end, multimodal, fully convolutional network for extracting semantic structures from document images. We consider document semantic structure extraction as a pixel-wise segmentation task, and propose a unified model that classifies pixels based not only on their visual appearance, as in the traditional page segmentation task, but also on the content of underlying text. Moreover, we propose an efficient synthetic document generation process that we use to generate pretraining data for our network. Once the network is trained on a large set of synthetic documents, we fine-tune the network on unlabeled real documents using a semi-supervised approach. We systematically study the optimum network architecture and show that both our multimodal approach and the synthetic data pretraining significantly boost the performance.

UR - http://www.scopus.com/inward/record.url?scp=85044271360&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85044271360&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2017.462

DO - 10.1109/CVPR.2017.462

M3 - Conference contribution

T3 - Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

SP - 4342

EP - 4351

BT - Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Yang X, Yumer E, Asente P, Kraley M, Kifer D, Giles CL. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 4342-4351. (Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017). https://doi.org/10.1109/CVPR.2017.462