Re: benchmarking the query planner

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd\(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: benchmarking the query planner
Date: 2008-12-12 16:10:13
Message-ID: 87tz99s7bu.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote:
>> It is a pretty well-known mathematical fact that for something like an
>> opinion poll your margin of error does not depend on the size of the
>> population but only on the size of your sample.
>
> Yes, I agree with that *but* that refers to population statistics and
> has nothing at all to do with the calculation of ndistinct, which is
> what we were discussing. You can't just switch topics and have the
> statement remain true.

If you go back to my email that was kind of my point. The existing sample size
is on a solid foundation for the histograms and most use cases for the
statistics. But entirely bogus for ndistinct.

The ndistinct estimate is just piggy-backing on that data. However to fix it
would require switching over to scanning a percentage of the whole table which
would be a massive increase in work for that one calculation. You can't fix it
by just adjusting the sample size slightly.

> Note that the estimator we use was shown to be stable in the range of
> sample size between 5-20%.
> http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf

Uhm, this is a survey of lots of different methods and does lots of analysis.
I don't see any simple conclusions about stability. Perhaps I'm just missing
it in the technical details. Could you point out exactly what part of the
paper you're basing this on and what "stable" means?

> We currently use a sample size of 300*stats_target. With default=10 that
> means our sample size is 0.3% on a 1 million row table, and 0.003% on a
> 100 million row table (they're common, I find).
>
> That size of sample is OK for some kinds of statistics, but not much
> good for ndistinct estimation.

Right, but increasing our sample size by a factor of 150 for a 100M row table
doesn't seem like a reasonable solution to one metric being bogus.

For that matter, if we do consider sampling 5% of the table we may as well
just go ahead and scan the whole table. It wouldn't take much longer and it
would actually produce good estimates.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-12-12 16:16:16 Re: benchmarking the query planner
Previous Message Grzegorz Jaskiewicz 2008-12-12 15:56:59 Re: Polymorphic types vs. domains