Re: ANALYZE sampling is too good

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 20:25:59
Message-ID: 52A38457.6090204@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/07/2013 11:46 AM, Robert Haas wrote:
> Maybe there's some highly-principled statistical approach which could
> be taken here, and if so that's fine, but I suspect not. So what I
> think we should do is auto-tune the statistics target based on the
> table size. If, say, we think that the generally useful range for the
> statistics target is something like 10 to 400, then let's come up with
> a formula based on table size that outputs 10 for small tables, 400
> for really big tables, and intermediate values for tables in the
> middle.

The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate. Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.

This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables. We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.

There's other qualitative improvements we could make, which Nathan Boley
has spoken on. For example, our stats code has no way to recognize a
normal or exponential distrbution -- it assumes that all columns are
randomly distributed. If we could recoginze common distribution
patterns, then not only could we have better query estimates, those
would require keeping *fewer* stats, since all you need for a normal
distribution are the end points and the variance.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-12-07 20:33:19 Re: WITHIN GROUP patch
Previous Message Steve Singer 2013-12-07 20:18:06 Re: plpgsql_check_function - rebase for 9.3