Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discrete-valued data fields

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Schema matching and value mapping across two information sources, such as databases, are critical information aggregation tasks. Before data can be integrated from multiple tables, the columns and values within the tables must be matched. The complexities of both these problems grow quickly with the number of attributes to be matched and due to multiple semantics of data values. Traditional research has mostly tackled schema matching and value mapping independently, and for categorical (discrete-valued) attributes. We propose novel methods that leverage value mappings to enhance schema matching in the presence of opaque column names for schemas consisting of both continuous and discrete-valued attributes. An additional source of complexity is that a discrete-valued attribute in one schema could in fact be a quantized, encoded version of a continuous-valued attribute in the other schema. In our approach, which can tackle both "onto" and bijective schema matching, the fitness objective for matching a pair of attributes from two schemas exploits the statistical distribution over values within the two attributes. Suitable fitness objectives are based on Euclidean-distance and the data log-likelihood, both of which are applied in our experimental study. A heuristic local descent optimization strategy that uses two-opt switching to optimize attribute matches, while simultaneously embedding value mappings, is applied for our matching methods. Our experiments show that the proposed techniques matched mixed continuous and discrete-valued attribute schemas with high accuracy and, thus, should be a useful addition to a framework of (semi) automated tools for data alignment.

Original languageEnglish (US)
Article number2
JournalACM Transactions on Database Systems
Volume38
Issue number1
DOIs
StatePublished - Apr 1 2013

All Science Journal Classification (ASJC) codes

  • Information Systems

Fingerprint Dive into the research topics of 'Schema matching and embedded value mapping for databases with opaque column names and mixed continuous and discrete-valued data fields'. Together they form a unique fingerprint.

Cite this