Extracting information from coarser-grained data in XML documents

Research output: Contribution to journalArticlepeer-review

Abstract

XML is fast emerging as the dominant standard for representing data in the applications centric documents. While there has been a great deal of works recently proposing the extraction of relevant data of natural langujge texts, Most of the underlying works confront with the irregular structure hidden in the text. To this end, a large spectrum of wrappers has been conceived lot- web pages. Unfortunately, they cannot deal with semi-structured data and cannot still take into consideration the natural language processing. In this paper, we present a specification language to write expressive and easy extraction patterns. The specification relies on rectular expression fashion in order to write patterns by non expert users. In addition, we introduce the Xtractor wrapper for coarser-grained data (i.e. paragraphs). The Xtractor hinges on linguistic parsing. of paragraphs and applies technical and natural language dictionaries. Then it employs the extraction patterns against the pre-processed paragraphs in order to locate relevant data. The key idea of our approach consists of translating the extraction patterns to Finite State Transducers (FST) and even using the FST to build the domain specific dictionaries.

Original languageEnglish (US)
Pages (from-to)117-123
Number of pages7
JournalJournal of Digital Information Management
Volume4
Issue number2
StatePublished - Jun 1 2006

All Science Journal Classification (ASJC) codes

  • Management Information Systems
  • Information Systems
  • Library and Information Sciences

Fingerprint Dive into the research topics of 'Extracting information from coarser-grained data in XML documents'. Together they form a unique fingerprint.

Cite this