Abstract
XML is fast emerging as the dominant standard for representing data in the applications centric documents. While there has been a great deal of works recently proposing the extraction of relevant data of natural langujge texts, Most of the underlying works confront with the irregular structure hidden in the text. To this end, a large spectrum of wrappers has been conceived lot- web pages. Unfortunately, they cannot deal with semi-structured data and cannot still take into consideration the natural language processing. In this paper, we present a specification language to write expressive and easy extraction patterns. The specification relies on rectular expression fashion in order to write patterns by non expert users. In addition, we introduce the Xtractor wrapper for coarser-grained data (i.e. paragraphs). The Xtractor hinges on linguistic parsing. of paragraphs and applies technical and natural language dictionaries. Then it employs the extraction patterns against the pre-processed paragraphs in order to locate relevant data. The key idea of our approach consists of translating the extraction patterns to Finite State Transducers (FST) and even using the FST to build the domain specific dictionaries.
Original language | English (US) |
---|---|
Pages (from-to) | 117-123 |
Number of pages | 7 |
Journal | Journal of Digital Information Management |
Volume | 4 |
Issue number | 2 |
State | Published - Jun 1 2006 |
All Science Journal Classification (ASJC) codes
- Management Information Systems
- Information Systems
- Library and Information Sciences