Searching dimension incomplete databases

Wei Cheng, Xiaoming Jin, Jian Tao Sun, Xuemin Lin, Xiang Zhang, Wei Wang

Research output: Contribution to journalArticle

13 Citations (Scopus)

Abstract

Similarity query is a fundamental problem in database, data mining and information retrieval research. Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional querying techniques. The existing work on querying incomplete data addresses the problem where the data values on certain dimensions are unknown. However, in many real-life applications, such as data collected by a sensor network in a noisy environment, not only the data values but also the dimension information may be missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete data. A probabilistic framework is developed to model this problem so that the users can find objects in the database that are similar to the query with probability guarantee. Missing dimension information poses great computational challenge, since all possible combinations of missing dimensions need to be examined when evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant data objects without explicitly examining all missing dimension combinations. A probability triangle inequality is also employed to further prune the search space and speed up the query process. The proposed probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.

Original languageEnglish (US)
Article number6412668
Pages (from-to)725-738
Number of pages14
JournalIEEE Transactions on Knowledge and Data Engineering
Volume26
Issue number3
DOIs
StatePublished - Mar 1 2014

Fingerprint

Information retrieval
Sensor networks
Data mining

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Cheng, Wei ; Jin, Xiaoming ; Sun, Jian Tao ; Lin, Xuemin ; Zhang, Xiang ; Wang, Wei. / Searching dimension incomplete databases. In: IEEE Transactions on Knowledge and Data Engineering. 2014 ; Vol. 26, No. 3. pp. 725-738.
@article{021c4b85388c4b4785a2e9ed466bb35e,
title = "Searching dimension incomplete databases",
abstract = "Similarity query is a fundamental problem in database, data mining and information retrieval research. Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional querying techniques. The existing work on querying incomplete data addresses the problem where the data values on certain dimensions are unknown. However, in many real-life applications, such as data collected by a sensor network in a noisy environment, not only the data values but also the dimension information may be missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete data. A probabilistic framework is developed to model this problem so that the users can find objects in the database that are similar to the query with probability guarantee. Missing dimension information poses great computational challenge, since all possible combinations of missing dimensions need to be examined when evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant data objects without explicitly examining all missing dimension combinations. A probability triangle inequality is also employed to further prune the search space and speed up the query process. The proposed probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.",
author = "Wei Cheng and Xiaoming Jin and Sun, {Jian Tao} and Xuemin Lin and Xiang Zhang and Wei Wang",
year = "2014",
month = "3",
day = "1",
doi = "10.1109/TKDE.2013.14",
language = "English (US)",
volume = "26",
pages = "725--738",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "3",

}

Cheng, W, Jin, X, Sun, JT, Lin, X, Zhang, X & Wang, W 2014, 'Searching dimension incomplete databases', IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 3, 6412668, pp. 725-738. https://doi.org/10.1109/TKDE.2013.14

Searching dimension incomplete databases. / Cheng, Wei; Jin, Xiaoming; Sun, Jian Tao; Lin, Xuemin; Zhang, Xiang; Wang, Wei.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 26, No. 3, 6412668, 01.03.2014, p. 725-738.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Searching dimension incomplete databases

AU - Cheng, Wei

AU - Jin, Xiaoming

AU - Sun, Jian Tao

AU - Lin, Xuemin

AU - Zhang, Xiang

AU - Wang, Wei

PY - 2014/3/1

Y1 - 2014/3/1

N2 - Similarity query is a fundamental problem in database, data mining and information retrieval research. Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional querying techniques. The existing work on querying incomplete data addresses the problem where the data values on certain dimensions are unknown. However, in many real-life applications, such as data collected by a sensor network in a noisy environment, not only the data values but also the dimension information may be missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete data. A probabilistic framework is developed to model this problem so that the users can find objects in the database that are similar to the query with probability guarantee. Missing dimension information poses great computational challenge, since all possible combinations of missing dimensions need to be examined when evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant data objects without explicitly examining all missing dimension combinations. A probability triangle inequality is also employed to further prune the search space and speed up the query process. The proposed probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.

AB - Similarity query is a fundamental problem in database, data mining and information retrieval research. Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional querying techniques. The existing work on querying incomplete data addresses the problem where the data values on certain dimensions are unknown. However, in many real-life applications, such as data collected by a sensor network in a noisy environment, not only the data values but also the dimension information may be missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete data. A probabilistic framework is developed to model this problem so that the users can find objects in the database that are similar to the query with probability guarantee. Missing dimension information poses great computational challenge, since all possible combinations of missing dimensions need to be examined when evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant data objects without explicitly examining all missing dimension combinations. A probability triangle inequality is also employed to further prune the search space and speed up the query process. The proposed probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.

UR - http://www.scopus.com/inward/record.url?scp=84894437013&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84894437013&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2013.14

DO - 10.1109/TKDE.2013.14

M3 - Article

AN - SCOPUS:84894437013

VL - 26

SP - 725

EP - 738

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 3

M1 - 6412668

ER -