Re: ANALYZE sampling is too good

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-09 21:18:59
Message-ID: CAMkU=1yVmSVjSgw0qL9imbabLa7_kUYQxu9ZcdO1o357kq=_Zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> > I always gave the party line that ANALYZE only takes a small
> > constant-sized sample so even very large tables should be very quick.
> > But after hearing the same story again in Heroku I looked into it a
> > bit further. I was kind of shocked but the numbers.
> >
> > ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> > pretty reasonable but with default_statistics_target set to 100 that's
> > 30,000 rows. If I'm reading the code right It takes this sample by
> > sampling 30,000 blocks and then (if the table is large enough) taking
> > an average of one row per block. Each block is 8192 bytes so that
> > means it's reading 240MB of each table.That's a lot more than I
> > realized.
>
> That is a lot. On the other hand, I question the subject line:
> sometimes, our ANALYZE sampling is not good enough. Before we raised
> the default statistics target from 10 to 100, complaints about bad
> plan choices due to insufficiently-precise statistics were legion --
> and we still have people periodically proposing to sample a fixed
> percentage of the table instead of a fixed amount of the table, even
> on large tables, which is going the opposite direction. I think this
> is because they're getting really bad n_distinct estimates, and no
> fixed-size sample can reliably give a good one.
>

I don't recall ever tracing a bad plan down to a bad n_distinct. I have
seen several that were due to bad frequency estimates in MCV list, because
hash join planning is extremely sensitive to that. Do we have some kind of
catalog of generators of problematic data, so that changes can be tested on
known problem sets? Perhaps a wiki page to accumulate them would be
useful. For automated testing I guess the generator and query is the easy
part, the hard part is the cost settings/caching/RAM needed to trigger the
problem, and parsing and interpreting the results.

>
> More generally, I think the basic problem that people are trying to
> solve by raising the statistics target is avoid index scans on
> gigantic tables. Obviously, there are a lot of other situations where
> inadequate statistics can cause problems, but that's a pretty
> easy-to-understand one that we do not always get right. We know that
> an equality search looking for some_column = 'some_constant', where
> some_constant is an MCV, must be more selective than a search for the
> least-frequent MCV. If you store more and more MCVs for a table,
> eventually you'll have enough that the least-frequent one is pretty
> infrequent, and then things work a lot better.
>

My reading of the code is that if it is not in the MCV, then it is assumed
to have the average selectivity (about 1/n_distinct, but deflating top and
bottom for the MCV list). There is also a check that it is less than the
least common of the MCV, but I don't know why that situation would ever
prevail--that should always be higher or equal to the average selectivity.

>
> This is more of a problem for big tables than for small tables. MCV
> #100 can't have a frequency of greater than 1/100 = 0.01, but that's a
> lot more rows on a big table than small one. On a table with 10
> million rows we might estimate something close to 100,000 rows when
> the real number is just a handful; when the table has only 10,000
> rows, we just can't be off by as many orders of magnitude. Things
> don't always work out that badly, but in the worst case they do.
>
> Maybe there's some highly-principled statistical approach which could
> be taken here, and if so that's fine, but I suspect not. So what I
> think we should do is auto-tune the statistics target based on the
> table size. If, say, we think that the generally useful range for the
> statistics target is something like 10 to 400, then let's come up with
> a formula based on table size that outputs 10 for small tables, 400
> for really big tables, and intermediate values for tables in the
> middle.
>

I think that parts of the planner are N^2 in the size of histogram (or was
that the size of the MCV list?). So we would probably need a way to use a
larger sample size to get more accurate n_distinct and MCV frequencies, but
not save the entire histogram that goes with that sample size.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2013-12-09 21:20:17 Re: ANALYZE sampling is too good
Previous Message Jim Nasby 2013-12-09 21:14:31 Re: plpgsql_check_function - rebase for 9.3