Turbo-charging hidden database samplers with overflowing queries and skew reduction

Arjun Dasgupta, Nan Zhang, Gautam Das

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Scopus citations

Abstract

Recently, there has been growing interest in random sampling from online hidden databases. These databases reside behind form-like web interfaces which allow users to execute search queries by specifying the desired values for certain attributes, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable scoring function. In this paper, we consider the problem of uniform random sampling over such hidden databases. A key challenge is to eliminate the skew of samples incurred by the selective return of highly ranked tuples. To address this challenge, all state-of-the-art samplers share a common approach: they do not use overflowing queries. This is done in order to avoid favoring highly ranked tuples and thus incurring high skew in the retrieved samples. However, not considering overflowing queries substantially impacts sampling efficiency. In this paper, we propose novel sampling techniques which do leverage overflowing queries. As a result, we are able to significantly improve sampling efficiency over the state-of-the-art samplers, while at the same time substantially reduce the skew of generated samples. We conduct extensive experiments over synthetic and real-world databases to illustrate the superiority of our techniques over the existing ones.

Original languageEnglish (US)
Title of host publicationAdvances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings
Pages51-62
Number of pages12
DOIs
StatePublished - 2010
Event13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010 - Lausanne, Switzerland
Duration: Mar 22 2010Mar 26 2010

Publication series

NameAdvances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings

Other

Other13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010
Country/TerritorySwitzerland
CityLausanne
Period3/22/103/26/10

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Turbo-charging hidden database samplers with overflowing queries and skew reduction'. Together they form a unique fingerprint.

Cite this