Re: ANALYZE sampling is too good

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 23:27:31
Message-ID: 52A3AEE3.2030706@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/12/13 09:25, Josh Berkus wrote:
> On 12/07/2013 11:46 AM, Robert Haas wrote:
>> Maybe there's some highly-principled statistical approach which could
>> be taken here, and if so that's fine, but I suspect not. So what I
>> think we should do is auto-tune the statistics target based on the
>> table size. If, say, we think that the generally useful range for the
>> statistics target is something like 10 to 400, then let's come up with
>> a formula based on table size that outputs 10 for small tables, 400
>> for really big tables, and intermediate values for tables in the
>> middle.
>
> The only approach which makes sense is to base it on a % of the table.
> In fact, pretty much every paper which has examined statistics
> estimation for database tables has determined that any estimate based on
> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
> samples are 100% accurate, but at least they fit the 80/20 rule.
>
> This is the reason why implementing block-based sampling is critical;
> using our current "take one row out of every page" method, sampling 5%
> of the table means scanning the whole thing in most tables. We also
> need to decouple the number of MCVs we keep from the sample size.
> Certainly our existing sampling algo seems designed to maximize IO for
> the sample size.
>
> There's other qualitative improvements we could make, which Nathan Boley
> has spoken on. For example, our stats code has no way to recognize a
> normal or exponential distrbution -- it assumes that all columns are
> randomly distributed. If we could recoginze common distribution
> patterns, then not only could we have better query estimates, those
> would require keeping *fewer* stats, since all you need for a normal
> distribution are the end points and the variance.
>

From src/backend/commands/analyze.c

* As of May 2004 we use a new two-stage method: Stage one selects up
* to targrows random blocks (or all blocks, if there aren't so many).
* Stage two scans these blocks and uses the Vitter algorithm to create
* a random sample of targrows rows (or less, if there are less in the
* sample of blocks). The two stages are executed simultaneously: each
* block is processed as soon as stage one returns its number and while
* the rows are read stage two controls which ones are to be inserted
* into the sample.

I don't think we always read every block (been a while since I looked at
this code, so I'll recheck). From what I understand this stuff is based
on reasonable research (Vitter algorithm). Not to say it
couldn't/shouldn't be looked at again to improve it - but it is not just
dreamed up with no valid research!

Cheers

Mark

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2013-12-07 23:41:08 Re: WITHIN GROUP patch
Previous Message Jeff Davis 2013-12-07 23:05:44 Re: Extension Templates S03E11