Re: Improving N-Distinct estimation by ANALYZE

From: Greg Stark <gsstark(at)mit(dot)edu>
To: josh(at)agliodbs(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving N-Distinct estimation by ANALYZE
Date: 2006-01-06 06:24:41
Message-ID: 871wzl50wm.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Josh Berkus <josh(at)agliodbs(dot)com> writes:

> Greg,
>
> > We *currently* use a block based sampling algorithm that addresses this
> > issue by taking care to select rows within the selected blocks in an
> > unbiased way. You were proposing reading *all* the records from the
> > selected blocks, which throws away that feature.
>
> The block-based algorithms have specific math to cope with this. Stuff
> which is better grounded in statistical analysis than our code. Please
> read the papers before you judge the solution.

I'm confused since Postgres's current setup *was* based on papers on just this
topic. I haven't read any of the papers myself, I'm just repeating what was
previously discussed when the current block sampling code went in. There was
extensive discussion of the pros and cons of different algorithms taken from
various papers and how they related to Postgres. I don't recall any of them
suggesting that there was any way to take a sample which included every row
from a sampled block and then somehow unbias the sample after the fact.

> Right, which is why researchers are currently looking for something better.
> The Brutlag & Richardson claims to be able to produce estimates which are
> within +/- 3x 90% of the time using a 5% sample, which is far better than
> our current accuracy. Nobody claims to be able to estimate based on <
> 0.1% of the table, which is what Postgres tries to do for large tables.
>
> 5% based on block-based sampling is reasonable; that means a straight 5% of
> the on-disk size of the table, so 5gb for a 100gb table. With random-row
> sampling, that would require as much as 25% of the table, making it easier
> to just scan the whole thing.

Postgres's current sample sizes are clearly geared towards the histograms
where they are entirely realistic. All of the distinct estimates are clearly
just ad hoc attempts based on the existing sampling.

Is a mechanism that is only 5x faster than reading the whole table (assuming
random_page_cost of 4) and is off by more than a factor of three 10% of the
time really worth it?

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2006-01-06 15:20:15 Re: [HACKERS] Inconsistent syntax in GRANT
Previous Message Bruno Wolff III 2006-01-06 04:56:49 Re: [HACKERS] Inconsistent syntax in GRANT