Re: ANALYZE sampling is too good

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ANALYZE sampling is too good
Date: 2013-12-10 23:03:49
Message-ID: CAMkU=1xesmEJLmcgjpBUzR3VrqV0ZPqkcXCNvb=_eHqWK2V_vA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 9, 2013 at 2:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > My reading of the code is that if it is not in the MCV, then it is
> assumed
> > to have the average selectivity (about 1/n_distinct, but deflating top
> and
> > bottom for the MCV list). There is also a check that it is less than the
> > least common of the MCV, but I don't know why that situation would ever
> > prevail--that should always be higher or equal to the average
> selectivity.
>
> I've never seen an n_distinct value of more than 5 digits, regardless
> of reality. Typically I've seen 20-50k, even if the real number is
> much higher.

I don't recall seeing an n_distinct that is literally above 100,000 in the
wild, but I've seen negative ones that multiply with reltuples to give
values more than that. In test cases it is easy enough to get values in
the millions by creating tables using floor(random()*$something).

create table baz as select floor(random()*10000000), md5(random()::text)
from generate_series(1,100000000);
create table baz2 as select * from baz order by floor;
create table baz3 as select * from baz order by md5(floor::text);

baz unclustered, baz2 is clustered with perfect correlation, baz3 is
clustered but without correlation.

After analyzing all of them:

select tablename, n_distinct,correlation from pg_stats where tablename
like 'baz%' and attname='floor' ;
tablename | n_distinct | correlation
-----------+-------------+-------------
baz | 8.56006e+06 | 0.00497713
baz2 | 333774 | 1
baz3 | 361048 | -0.0118147

So baz is pretty close, while the other two are way off. But the
n_distinct computation doesn't depend on the order of the rows, as far as I
can tell, but only the composition of the sample. So this can only mean
that our "random" sampling method is desperately non-random, right?

> But the n_distinct value is only for non-MCVs, so if we
> estimate the selectivity of column = 'rarevalue' to be
> (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces
> the estimate, and making the MCV list longer naturally makes mcvfrac
> bigger.

Ah, I see. By including more things into MCV, we crowd out the rest of the
space implicitly.

> I'm not sure how important the
> less-frequent-than-the-least-common-MCV part is, but I'm very sure
> that raising the statistics target helps to solve the problem of
> overestimating the prevalence of uncommon values in a very big table.
>
> > I think that parts of the planner are N^2 in the size of histogram (or
> was
> > that the size of the MCV list?). So we would probably need a way to use
> a
> > larger sample size to get more accurate n_distinct and MCV frequencies,
> but
> > not save the entire histogram that goes with that sample size.
>
> I think the saving the histogram part is important. As you say, the
> MCVs are important for a variety of planning purposes, such as hash
> joins. More than that, in my experience, people with large tables are
> typically very willing to spend more planning time to get a better
> plan, because mistakes are expensive and the queries are likely to run
> for a while anyway. People with small tables care about planning

time, because it makes no sense to spend an extra 1ms planning a query
> unless you improve the plan by enough to save at least 1ms when
> executing it, and when the tables are small and access is expected to
> be fast anyway that's often not the case.
>

I would think that the dichotomy is more about the size of the query-plan
than of the tables. I think a lot of people with huge tables end up doing
mostly indexed lookups in unique or highly selective indexes, once all the
planning is done.

Does anyone have generators for examples of cases where increasing the
sample size to get better histograms (as opposed more accurate n_distinct
or more accurate MCV) was the key to fixing bad plans? I wonder if it is
more bins, or more accurate boundaries, that makes the difference.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-12-10 23:06:22 Re: Reference to parent query from ANY sublink
Previous Message Peter Geoghegan 2013-12-10 23:02:35 Re: pg_stat_statements fingerprinting logic and ArrayExpr