Re: limit clause breaks query planner?

From: "Scott Carey" <scott(at)richrelevance(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Guillaume Cottenceau" <gc(at)mnc(dot)ch>, "Matthew Wakeling" <matthew(at)flymine(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: limit clause breaks query planner?
Date: 2008-09-04 18:04:29
Message-ID: a1ec7d000809041104j554bc9d9i641ce1a594108a42@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, Sep 4, 2008 at 10:14 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>
> Actually, an even easier hack (which would have the nice property of not
> needing to know the exact value being searched for), would simply use
> the existing cost estimates if the WHERE variables have low correlation
> (meaning the random-locations assumption is probably good), but apply
> some sort of penalty factor if the correlation is high. This would
> amount to assuming that the universe is perverse and high correlation
> will always mean that the rows we want are at the wrong end of the table
> not the right one. But any DBA will tell you that the universe is
> indeed perverse ;-)

As a user, I prefer this solution. For one, statistics get out of date. A
few updates/ inserts (maybe 3% of the table) can greatly
affect an individual value in the histogram, breaking the location
assumptions and create a worst case result.
But when 3% of rows change, the correlation cannot change as drastically as
the histogram in the worst case. When deciding how to go about using the
statistics for a query plan, its best to assume that they are not perfect,
and that there has been some change since they were last gathered. This is
but one thing that makes the universe perverse :)
The safe assumption is that if the distribution is not random, the expected
average number of rows to scan goes up -- this is statistically correct.
With perfect correlation and unknown locations the average expected value
number of scanned rows would be ~ half the table. Equal likelihood of the
first 15 rows as the last 15. Thus, for perfect correlation the average
penalty would be half the table scanned, and worst case would be the whole
table. In this case, if the statistics are out of date somewhat, the cost
estimate is not likely to be more than a factor of 2 off. If one were to
use the histogram, the cost estimate could be many orders of magnitude off.

I'm fairly sure the penalty function to use for correlation values can be
derived statistically -- or at least approximated well enough for a look up
table.
If the histogram is used, the odds of it being wrong or out of date have to
be taken into account since the penalty for being incorrect is potentially
very large -- its not a gradual increase in cost for a small error, it is a
big and uncertain increase. I see the query planner's main goal is to avoid
the worst outcomes more than finding the absolute best one. Better to
produce 90% optimized queries 100% of the time than make 100% perfect
plans 90% of the time and then 10% of the time produce very bad plans.
Your suggestion above would do the best job avoiding bad plans but could
miss squeezing out that last few % in rare cases that probably don't matter
that much.

>
> OTOH, since indexscans get a cost estimate reduction in the presence of
> high correlation, we're already biasing the choice in the direction of
> indexscans for high correlation. We may not need to do it twice.

Because the full scan is compared against more than just index scans
(potentially), it makes sense to adjust each accordingly and independently.
Additionally, the index scan cost reduction function and the full table scan
cost increase function as correlation changes may have very different,
nonlinear 'shapes' between 0 and 1.

>
> I don't recall whether the OP ever showed us his statistics for the
> table in question --- did it even appear to have high correlation?
>
> regards, tom lane
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Kiran Mukhyala 2008-09-04 18:21:52 inaccurate stats on large tables
Previous Message Tom Lane 2008-09-04 17:14:32 Re: limit clause breaks query planner?