Re: Planner mis-estimation using nested loops followup

From: "Stephen Denne" <Stephen(dot)Denne(at)datamail(dot)co(dot)nz>
To: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>, "Chris Kratz" <chris(dot)kratz(at)vistashare(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Planner mis-estimation using nested loops followup
Date: 2008-03-18 21:26:58
Message-ID: F0238EBA67824444BC1CB4700960CB4804EAC54F@dmpeints002.isotach.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Scott Marlowe wrote
> On Tue, Mar 18, 2008 at 9:58 AM, Chris Kratz
> <chris(dot)kratz(at)vistashare(dot)com> wrote:
> > Y, turning nested loops off in specific cases has increased
> performance
> > greatly. It didn't fix the planner mis-estimation, just
> the plan it chose.
> > It's certainly not a panacea, but it's something we now try
> early on when
> > trying to speed up a query that matches these characteristics.
>
> I have to admit I've had one or two reporting queries in the past that
> turning off nested_loop was the only reasonable fix due to
> misestimation. I'd tried changing the stats targets etc and nothing
> really worked reliably to prevent the nested_loop from showing up in
> the wrong places.

One cause of planner mis-estimation I've seen quite frequently is when there are a number of predicates on the data that filter the results in roughly the same manner. PostgreSQL, not knowing that the filters are highly correlated, multiplies the "fraction of selected rows" together.

Making up an example using pseudo-code, if this is one of the subqueries:

select * from orders where
order_date is recent
and
order_fulfilled is false

Used in an application where the unfulfilled orders are the recent ones.

If postgresql estimates that 1% of the orders are recent, and 1% are unfulfilled, then it will assume that 0.01% are both recent and unfulfilled. If in reality it's more like 0.9%, and your actual row count will be 90 times your estimate.

The only kind of simple behind-the-scenes fix for these situations that I know of is to add more indexes (such as a partial index on order_date where order_fulfilled is false), which slows down all your updates, and only works for the simplest situations.

A general fix would need to calculate, store, and lookup a huge amount of correlation data. Probably equal to the square of the number of rows in pg_stats, though this could possibly be generated as needed.

Perhaps if the analyze command was extended to be able to take a command line like:
ANALYZE CARTESIAN CORRELATION orders(order_date,order_fulfilled);
which stores the fraction for each combination of most frequent value, and domain buckets from order_date and order_fulfilled.
The difficulty is whether the planner can quickly and easily determine whether appropriate correlation data exists for the query plan it is estimating.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.
__________________________________________________________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Craig Ringer 2008-03-18 22:44:57 Re: What is the best way to storage music files in Postgresql
Previous Message Mark Steben 2008-03-18 21:06:45 question on TRUNCATE vs VACUUM FULL