Re: ANALYZE sampling is too good

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-07 19:46:17
Message-ID: CA+TgmoZaqyGSuaL2v+YFVsX06DQDQh-pEV0nobGPws-dNwAwBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> I always gave the party line that ANALYZE only takes a small
> constant-sized sample so even very large tables should be very quick.
> But after hearing the same story again in Heroku I looked into it a
> bit further. I was kind of shocked but the numbers.
>
> ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> pretty reasonable but with default_statistics_target set to 100 that's
> 30,000 rows. If I'm reading the code right It takes this sample by
> sampling 30,000 blocks and then (if the table is large enough) taking
> an average of one row per block. Each block is 8192 bytes so that
> means it's reading 240MB of each table.That's a lot more than I
> realized.

That is a lot. On the other hand, I question the subject line:
sometimes, our ANALYZE sampling is not good enough. Before we raised
the default statistics target from 10 to 100, complaints about bad
plan choices due to insufficiently-precise statistics were legion --
and we still have people periodically proposing to sample a fixed
percentage of the table instead of a fixed amount of the table, even
on large tables, which is going the opposite direction. I think this
is because they're getting really bad n_distinct estimates, and no
fixed-size sample can reliably give a good one.

More generally, I think the basic problem that people are trying to
solve by raising the statistics target is avoid index scans on
gigantic tables. Obviously, there are a lot of other situations where
inadequate statistics can cause problems, but that's a pretty
easy-to-understand one that we do not always get right. We know that
an equality search looking for some_column = 'some_constant', where
some_constant is an MCV, must be more selective than a search for the
least-frequent MCV. If you store more and more MCVs for a table,
eventually you'll have enough that the least-frequent one is pretty
infrequent, and then things work a lot better.

This is more of a problem for big tables than for small tables. MCV
#100 can't have a frequency of greater than 1/100 = 0.01, but that's a
lot more rows on a big table than small one. On a table with 10
million rows we might estimate something close to 100,000 rows when
the real number is just a handful; when the table has only 10,000
rows, we just can't be off by as many orders of magnitude. Things
don't always work out that badly, but in the worst case they do.

Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not. So what I
think we should do is auto-tune the statistics target based on the
table size. If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-12-07 19:48:49 Re: Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan
Previous Message Tom Lane 2013-12-07 19:41:54 Re: Index overhead cost reporting