Re: LIMIT confuses the planner

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: kouber(at)saparev(dot)com
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: LIMIT confuses the planner
Date: 2009-03-22 21:29:16
Message-ID: 17814.1237757356@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Kouber Saparev <kouber(at)saparev(dot)com> writes:
> Tom Lane wrote:
>> Hmph, that's still not real good. Ideally it should be estimating
>> *less* than the average frequency, because the estimate is made after
>> excluding all the most-common-values, which evidently 'kouber' is not
>> one of.

> I altered the statistics for that column to 1000, so now the planner
> assumes exactly 492 rows for the fore-mentioned query, which is indeed
> the average. It never went *less* than that value, it was always higher,
> i.e. for a statistics value of 600, it was 588, for 800, it became 540.

I got some time to think more about this and experiment a bit further.
As far as I can tell there is no fundamental bug here --- given
reasonably accurate stats the rowcount estimate behaves as expected, ie,
you get an estimate that's less than the actual average number of values
if the target value is not one of the known MCVs. However, as the
n_distinct estimate falls below the actual number of distinct values,
that rowcount estimate necessarily rises. What had surprised me about
this report is that the estimate matched the true average number of rows
so closely; I wondered if there was some property of the way we estimate
n_distinct that would make that happen. But I now think that that was
just chance: there doesn't seem to be any underlying behavior that would
cause it. I did some experiments with random data matching a Zipfian
distribution (1/k law) and did not observe that the rowcount estimate
converged to the true average when the n_distinct value was too low.

So the bottom line here is just that the estimated n_distinct is too
low. We've seen before that the equation we use tends to do that more
often than not. I doubt that consistently erring on the high side would
be better though :-(. Estimating n_distinct from a limited sample of
the population is known to be a statistically hard problem, so we'll
probably not ever have perfect answers, but doing better is on the
to-do list.

regards, tom lane

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Laurent Wandrebeck 2009-03-22 23:14:52 Re: "iowait" bug?
Previous Message Greg Smith 2009-03-22 20:17:40 Re: "iowait" bug?