Re: Improving N-Distinct estimation by ANALYZE

From: Greg Stark <gsstark(at)mit(dot)edu>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: tshipley(at)deru(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-05 15:12:29
Message-ID: 87d5j64ski.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Josh Berkus <josh(at)agliodbs(dot)com> writes:

> > These statements are at odds with my admittedly basic understanding of
> > statistics. Isn't the power of a sample more related to the absolute size of
> > the sample than the sample as fraction of the population? Why not just pick
> > a smallish sample size, say about 3000, and apply it to all the tables, even
> > the ones with just a single row (modify appropriately from block sampling).
>
> Nope, it's definitely proportional. As a simple example, a sample of 500 rows
> in a table of 1000 rows should yeild stats estimates with 90%+ accuracy. But a
> sample of 500 rows in a 600,000,000 row table is so small as to be nearly
> useless; it's quite possible to get all the same value in a random sample of <
> 0.1% even on a column with a D/N of 0.001. If you look at the papers cited,
> almost all researchers more recent than Chaudhuri use a proportional sample
> size.

To be clear Josh is talking specifically about the estimate of how many
distinct values a query will see. Not the more usual estimates of how many
records the query will see.

For estimating how many records a query like

SELECT * FROM tab WHERE x BETWEEN ? AND ?

the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc)
+/- 2% 19 times out of 20. That's the same sample size most major opinion
polls use...

However this same logic doesn't work for estimating distinct values. Since a
single occurrence of a distinct value is just as important as hundreds of
occurrences, and your chances of finding the single occurrence is proportional
to what percentage of the overall table you sample, to maintain a given
accuracy you're going to have to maintain a sample of percentage of the
overall table.

Worse, my recollection from the paper I mentioned earlier was that sampling
small percentages like 3-5% didn't get you an acceptable accuracy. Before you
got anything reliable you found you were sampling very large percentages of
the table. And note that if you have to sample anything over 10-20% you may as
well just read the whole table. Random access reads are that much slower.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2006-01-05 15:17:09 Re: Heads up: upcoming back-branch re-releases
Previous Message Greg Stark 2006-01-05 15:02:11 Re: Improving N-Distinct estimation by ANALYZE