Re: Multi-pass planner

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: decibel <decibel(at)decibel(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multi-pass planner
Date: 2013-04-19 21:19:46
Message-ID: CAMkU=1y13w5oYaJ_mwJ-dPpZAugOUTWfffgAKRa=5Po5gbznCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 3, 2013 at 6:40 PM, Greg Stark <stark(at)mit(dot)edu> wrote:

>
> On Fri, Aug 21, 2009 at 6:54 PM, decibel <decibel(at)decibel(dot)org> wrote:
>
>> Would it? Risk seems like it would just be something along the lines of
>> the high-end of our estimate. I don't think confidence should be that hard
>> either. IE: hard-coded guesses have a low confidence. Something pulled
>> right out of most_common_vals has a high confidence.
>>
>
I wouldn't be so sure of that. I've run into cases where all of the
frequencies pulled out of most_common_vals are off by orders of magnitude.
The problem is that if ANALYZE only samples 1/1000th of the table, and it
sees a value twice, it assumes the value is present 2000 times in the
table, even when it was only in the table twice. Now, for any given value
that occurs twice in the table, it is very unlikely for both of those to
end up in the sample. But when you have millions of distinct values which
each occur twice (or some low number of time), it is a near certainty that
several of them are going to end with both instances in the sample. Those
few ones that get "lucky" are of course going to end up in the
most_common_vals list.

Since the hashjoin estimates cost depending on the frequency of the most
common value, having this be systematically off by a factor of 1000 is
rather unfortunate.

The problem here is that the sample size which is adequate for getting a
good estimate of the histograms (which is what controls the sample size
currently) is not adequate for getting a good estimate of most_common_vals.
Cranking up the statistics_target would give a better estimates of
most_common_vals, but at the expense of having a needlessly large
histogram, which slows down planning. There is currently no knob to crank
up the sample size for the sake of most common values, but then prune the
histogram back down for storage.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2013-04-19 21:24:12 Re: Multi-pass planner
Previous Message Daniel Farina 2013-04-19 20:57:24 Re: confusing message about archive failures