Strong lower bounds for approximating distribution support size and the distinct elements problem

Sofya Raskhodnikova, Dana Ron, Amir Shpilka, Adam Davison Smith

Research output: Contribution to journalArticle

52 Citations (Scopus)

Abstract

We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least 1/n.This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length n. Charikar, Chaudhuri, Motwani, and Narasayya [in Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000, pp. 268-279] and Bar-Yossef, Kumar, and Sivakumar [in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, ACM Press, New York, 2001, pp. 266-275] proved that multiplicative approximation for these problems within a factor α > 1 requires Θ( n/α2 queries to the input sequence. Their lower bound applies only when the number of distinct elements (or the support size of a distribution) is very small. For both problems, we prove a nearly linear in n lower bound on the query complexity, applicable even when the number of distinct elements is large (up to linear in n) and even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, X1 and X2 with very different expectations and the following condition on the first k moments: E[X1]/E[X2] = X2 1]/E[X2 2]= ···=E[Xk 1]/E[Xk 2]. It is related to a well-studied mathematical question, the truncated Hamburger problem, but differs in the requirement that our random variables have to be supported on integers. Our lower bound method is also applicable to other problems and, in particular, gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

Original languageEnglish (US)
Pages (from-to)813-842
Number of pages30
JournalSIAM Journal on Computing
Volume39
Issue number3
DOIs
StatePublished - Aug 27 2009

Fingerprint

Random variables
Lower bound
Distinct
Entropy
Random variable
Query Complexity
Integer
Database Systems
Approximation
Annual
Multiplicative
Query
Moment
Computing
Requirements

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Mathematics(all)

Cite this

Raskhodnikova, Sofya ; Ron, Dana ; Shpilka, Amir ; Smith, Adam Davison. / Strong lower bounds for approximating distribution support size and the distinct elements problem. In: SIAM Journal on Computing. 2009 ; Vol. 39, No. 3. pp. 813-842.
@article{d4d1cc988af44c8e803bc9bb38368ca8,
title = "Strong lower bounds for approximating distribution support size and the distinct elements problem",
abstract = "We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least 1/n.This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length n. Charikar, Chaudhuri, Motwani, and Narasayya [in Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000, pp. 268-279] and Bar-Yossef, Kumar, and Sivakumar [in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, ACM Press, New York, 2001, pp. 266-275] proved that multiplicative approximation for these problems within a factor α > 1 requires Θ( n/α2 queries to the input sequence. Their lower bound applies only when the number of distinct elements (or the support size of a distribution) is very small. For both problems, we prove a nearly linear in n lower bound on the query complexity, applicable even when the number of distinct elements is large (up to linear in n) and even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, X1 and X2 with very different expectations and the following condition on the first k moments: E[X1]/E[X2] = X2 1]/E[X2 2]= ···=E[Xk 1]/E[Xk 2]. It is related to a well-studied mathematical question, the truncated Hamburger problem, but differs in the requirement that our random variables have to be supported on integers. Our lower bound method is also applicable to other problems and, in particular, gives a new lower bound for the sample complexity of approximating the entropy of a distribution.",
author = "Sofya Raskhodnikova and Dana Ron and Amir Shpilka and Smith, {Adam Davison}",
year = "2009",
month = "8",
day = "27",
doi = "10.1137/070701649",
language = "English (US)",
volume = "39",
pages = "813--842",
journal = "SIAM Journal on Computing",
issn = "0097-5397",
publisher = "Society for Industrial and Applied Mathematics Publications",
number = "3",

}

Strong lower bounds for approximating distribution support size and the distinct elements problem. / Raskhodnikova, Sofya; Ron, Dana; Shpilka, Amir; Smith, Adam Davison.

In: SIAM Journal on Computing, Vol. 39, No. 3, 27.08.2009, p. 813-842.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Strong lower bounds for approximating distribution support size and the distinct elements problem

AU - Raskhodnikova, Sofya

AU - Ron, Dana

AU - Shpilka, Amir

AU - Smith, Adam Davison

PY - 2009/8/27

Y1 - 2009/8/27

N2 - We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least 1/n.This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length n. Charikar, Chaudhuri, Motwani, and Narasayya [in Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000, pp. 268-279] and Bar-Yossef, Kumar, and Sivakumar [in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, ACM Press, New York, 2001, pp. 266-275] proved that multiplicative approximation for these problems within a factor α > 1 requires Θ( n/α2 queries to the input sequence. Their lower bound applies only when the number of distinct elements (or the support size of a distribution) is very small. For both problems, we prove a nearly linear in n lower bound on the query complexity, applicable even when the number of distinct elements is large (up to linear in n) and even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, X1 and X2 with very different expectations and the following condition on the first k moments: E[X1]/E[X2] = X2 1]/E[X2 2]= ···=E[Xk 1]/E[Xk 2]. It is related to a well-studied mathematical question, the truncated Hamburger problem, but differs in the requirement that our random variables have to be supported on integers. Our lower bound method is also applicable to other problems and, in particular, gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

AB - We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least 1/n.This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length n. Charikar, Chaudhuri, Motwani, and Narasayya [in Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000, pp. 268-279] and Bar-Yossef, Kumar, and Sivakumar [in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, ACM Press, New York, 2001, pp. 266-275] proved that multiplicative approximation for these problems within a factor α > 1 requires Θ( n/α2 queries to the input sequence. Their lower bound applies only when the number of distinct elements (or the support size of a distribution) is very small. For both problems, we prove a nearly linear in n lower bound on the query complexity, applicable even when the number of distinct elements is large (up to linear in n) and even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, X1 and X2 with very different expectations and the following condition on the first k moments: E[X1]/E[X2] = X2 1]/E[X2 2]= ···=E[Xk 1]/E[Xk 2]. It is related to a well-studied mathematical question, the truncated Hamburger problem, but differs in the requirement that our random variables have to be supported on integers. Our lower bound method is also applicable to other problems and, in particular, gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

UR - http://www.scopus.com/inward/record.url?scp=69049092204&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=69049092204&partnerID=8YFLogxK

U2 - 10.1137/070701649

DO - 10.1137/070701649

M3 - Article

AN - SCOPUS:69049092204

VL - 39

SP - 813

EP - 842

JO - SIAM Journal on Computing

JF - SIAM Journal on Computing

SN - 0097-5397

IS - 3

ER -