Quick Links

Re: Improving N-Distinct estimation by ANALYZE

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	tshipley(at)deru(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Improving N-Distinct estimation by ANALYZE
Date:	2006-01-05 15:12:29
Message-ID:	87d5j64ski.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:

> > These statements are at odds with my admittedly basic understanding of
> > statistics. Isn't the power of a sample more related to the absolute size of
> > the sample than the sample as fraction of the population? Why not just pick
> > a smallish sample size, say about 3000, and apply it to all the tables, even
> > the ones with just a single row (modify appropriately from block sampling).
>
> Nope, it's definitely proportional. As a simple example, a sample of 500 rows
> in a table of 1000 rows should yeild stats estimates with 90%+ accuracy. But a
> sample of 500 rows in a 600,000,000 row table is so small as to be nearly
> useless; it's quite possible to get all the same value in a random sample of <
> 0.1% even on a column with a D/N of 0.001. If you look at the papers cited,
> almost all researchers more recent than Chaudhuri use a proportional sample
> size.

To be clear Josh is talking specifically about the estimate of how many
distinct values a query will see. Not the more usual estimates of how many
records the query will see.

For estimating how many records a query like

SELECT * FROM tab WHERE x BETWEEN ? AND ?

the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc)
+/- 2% 19 times out of 20. That's the same sample size most major opinion
polls use...

However this same logic doesn't work for estimating distinct values. Since a
single occurrence of a distinct value is just as important as hundreds of
occurrences, and your chances of finding the single occurrence is proportional
to what percentage of the overall table you sample, to maintain a given
accuracy you're going to have to maintain a sample of percentage of the
overall table.

Worse, my recollection from the paper I mentioned earlier was that sampling
small percentages like 3-5% didn't get you an acceptable accuracy. Before you
got anything reliable you found you were sampling very large percentages of
the table. And note that if you have to sample anything over 10-20% you may as
well just read the whole table. Random access reads are that much slower.

--
greg

In response to

Re: Improving N-Distinct estimation by ANALYZE at 2006-01-05 06:23:41 from Josh Berkus

Responses

Re: Improving N-Distinct estimation by ANALYZE at 2006-01-05 19:58:18 from Jim C. Nasby

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2006-01-05 15:17:09	Re: Heads up: upcoming back-branch re-releases
Previous Message	Greg Stark	2006-01-05 15:02:11	Re: Improving N-Distinct estimation by ANALYZE