Re: ANALYZE sampling is too good

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-08 00:06:42
Message-ID: 52A3B812.4070407@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/12/13 12:27, Mark Kirkwood wrote:
> On 08/12/13 09:25, Josh Berkus wrote:
>> On 12/07/2013 11:46 AM, Robert Haas wrote:
>>> Maybe there's some highly-principled statistical approach which could
>>> be taken here, and if so that's fine, but I suspect not. So what I
>>> think we should do is auto-tune the statistics target based on the
>>> table size. If, say, we think that the generally useful range for the
>>> statistics target is something like 10 to 400, then let's come up with
>>> a formula based on table size that outputs 10 for small tables, 400
>>> for really big tables, and intermediate values for tables in the
>>> middle.
>>
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
>>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables. We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
>>
>> There's other qualitative improvements we could make, which Nathan Boley
>> has spoken on. For example, our stats code has no way to recognize a
>> normal or exponential distrbution -- it assumes that all columns are
>> randomly distributed. If we could recoginze common distribution
>> patterns, then not only could we have better query estimates, those
>> would require keeping *fewer* stats, since all you need for a normal
>> distribution are the end points and the variance.
>>
>
> From src/backend/commands/analyze.c
>
> * As of May 2004 we use a new two-stage method: Stage one selects up
> * to targrows random blocks (or all blocks, if there aren't so many).
> * Stage two scans these blocks and uses the Vitter algorithm to create
> * a random sample of targrows rows (or less, if there are less in the
> * sample of blocks). The two stages are executed simultaneously: each
> * block is processed as soon as stage one returns its number and while
> * the rows are read stage two controls which ones are to be inserted
> * into the sample.
>
> I don't think we always read every block (been a while since I looked at
> this code, so I'll recheck). From what I understand this stuff is based
> on reasonable research (Vitter algorithm). Not to say it
> couldn't/shouldn't be looked at again to improve it - but it is not just
> dreamed up with no valid research!
>

Since I volunteered to recheck :-)

from analyze.c again:

/*--------------------
* The following choice of minrows is based on the paper
* "Random sampling for histogram construction: how much is enough?"
* by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
* Proceedings of ACM SIGMOD International Conference on Management
* of Data, 1998, Pages 436-447. Their Corollary 1 to Theorem 5
* says that for table size n, histogram size k, maximum relative
* error in bin size f, and error probability gamma, the minimum
* random sample size is
* r = 4 * k * ln(2*n/gamma) / f^2
* Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
* r = 305.82 * k
* Note that because of the log function, the dependence on n is
* quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
* bin size error with probability 0.99. So there's no real need to
* scale for n, which is a good thing because we don't necessarily
* know it at this point.
*--------------------
*/
stats->minrows = 300 * attr->attstattarget;

So for typical settings (default_statictics_target = 100), we try to get
30000 rows. This means we will sample about 30000 blocks.

Indeed quickly checking with a scale 100 pgbench db and a simple patch
to make the block sampler say how many blocks it reads (note
pgbench_accounts has 163935 blocks):

bench=# ANALYZE pgbench_branches;
NOTICE: acquire sample will need 30000 blocks
NOTICE: sampled 1 blocks
ANALYZE
Time: 16.935 ms

bench=# ANALYZE pgbench_accounts;
NOTICE: acquire sample will need 30000 blocks
NOTICE: sampled 30000 blocks
ANALYZE
Time: 10059.446 ms
bench=# \q

Regards

Mark

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-12-08 00:10:23 Re: Extension Templates S03E11
Previous Message Peter Geoghegan 2013-12-08 00:00:19 Re: pg_stat_statements: calls under-estimation propagation