Re: benchmarking the query planner

From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: benchmarking the query planner
Date: 2008-12-12 12:14:36
Message-ID: 603c8f070812120414g2daba35egb10cc5d7cc2d7dee@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Robert Haas" <robertmhaas(at)gmail(dot)com> writes:
>> I had this idle thought too, but I didn't write it down because...
>
>>> ought to be, but it seems like it ought to be possible to determine
>>> that given a desired maximum error in the overall estimate. I'm also
>>> not very clear on what the "total frequency" computations (matchfreq2
>>> and unmatchfreq2 in the current code) ought to look like if we are using
>>> a variable subset of the inner list.
>
>> ...of this exact concern, which I think is an insurmountable problem.
>
> Maybe so. If we stick to the other design (end both lists at a preset
> frequency threshold) then the math clearly goes through the same as
> before, just with num_mcvs that are determined differently. But can
> we prove anything about the maximum error added from that?

I don't think so, because in that design, it's entirely possible that
you'll throw away the entire MCV list if all of the entries are below
the threshold (as in the example we were just benchmarking, supposing
a threshold of 0.001).

An alternative is to pick a threshold T for the maximum number of
equality probes that you're willing to suffer through. Then given two
MCV lists of lengths M1 and M2 such that M1 * M2 > T, pick the largest
N such that MIN(M1, N) * MIN(M2, N) <= T. This guarantees that you
always use at least T^(1/2) MCVs.

If you compare this approach with T = 10^6 vs. simply chopping off the
MCV list at p = 0.001, this approach will be more accurate at the cost
of more comparisons. For example in our test case where all the
comparisons fail, chopping off the MCV list at p = 0.001 results in
ignoring it completely, whereas with this approach we use all 1000
entries just as before. So it might be appropriate to choose a lower
threshold like, say, T = 10^5, since otherwise we're not preventing
any computation. (I suppose you could even make this a GUC since any
choice of value is going to be pretty arbitrary...)

I'm not sure to what extent we can bound the amount of error with this
approach, but it's definitely better. As we've seen, a frequency
cutoff can throw away the entire MCV list; this approach always
retains at least T^(1/2) entries, and more if the other list happens
to be shorter than T^(1/2). So we can say that the result will never
be worse than it would have been had you set the statistics target to
T^(1/2).

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2008-12-12 12:28:24 Re: psql commands for SQL/MED
Previous Message Pavan Deolasee 2008-12-12 12:11:48 Re: visibility maps