TY - CONF
T1 - A learning-based text synthesis engine for scene text detection
AU - Yang, Xiao
AU - He, Dafang
AU - Kifer, Daniel
AU - Lee Giles, C.
N1 - Funding Information:
We proposed the first learning-based and data-driven text synthesis engine. It first learns the conditional distribution of text locations on image plane using a variational auto-encoder, then composite text instances of the background image according to the locations sampled from the learned distribution. Then, a masked Cycle-GAN model is utilized to translate such text-composited images to ones that are indistinguishable from real-world images. We generated a large-scale synthetic dataset using our text synthesis engine and showed its effectiveness on the ICDAR 2015 dataset. When trained on our synthetic data, a baseline text detection method (EAST) outperforms one that is trained on previously proposed synthetic data. Combining our synthetic data and real data as the training set leads to the best performance. Future work includes learning text locations with perspective transformations and applying adversarial loss to the location module to generate more diversified samples. Acknowledgements: We gratefully acknowledge partial support from NSF grant CCF 1317560 and a hardware grant from NVIDIA.
PY - 2020
Y1 - 2020
N2 - Scene text detection (STD) and recognition (STR) methods have recently greatly improved with the use of synthetic training data playing an important role. That being said, for text detection task the performance of a model that is trained sorely on large-scale synthetic data is significantly worse than one trained on a few real-world data samples. However, state-of-the-art performance on text recognition can be achieved by only training on synthetic data [10]. This shows the limitations in only using large-scale synthetic data for scene text detection. In this work, we propose the first learning-based, data-driven text synthesis engine for scene text detection task. Our text synthesis engine is decomposed into two modules: 1) a location module that learns the distribution of text locations on the image plane, and 2) an appearance module that translates the text-inserted images to realistic-looking ones that are essentially indistinguishable from real-world scene text images. Evaluation of our created synthetic data on ICDAR 2015 Incidental Scene Text dataset [15] outperforms previous text synthesis methods.
AB - Scene text detection (STD) and recognition (STR) methods have recently greatly improved with the use of synthetic training data playing an important role. That being said, for text detection task the performance of a model that is trained sorely on large-scale synthetic data is significantly worse than one trained on a few real-world data samples. However, state-of-the-art performance on text recognition can be achieved by only training on synthetic data [10]. This shows the limitations in only using large-scale synthetic data for scene text detection. In this work, we propose the first learning-based, data-driven text synthesis engine for scene text detection task. Our text synthesis engine is decomposed into two modules: 1) a location module that learns the distribution of text locations on the image plane, and 2) an appearance module that translates the text-inserted images to realistic-looking ones that are essentially indistinguishable from real-world scene text images. Evaluation of our created synthetic data on ICDAR 2015 Incidental Scene Text dataset [15] outperforms previous text synthesis methods.
UR - http://www.scopus.com/inward/record.url?scp=85087333374&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85087333374&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85087333374
T2 - 30th British Machine Vision Conference, BMVC 2019
Y2 - 9 September 2019 through 12 September 2019
ER -