Re: Optimizer improvements: to do or not to do?

From: Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Optimizer improvements: to do or not to do?
Date: 2006-09-13 17:16:56
Message-ID: 45083D08.8040309@cheapcomplexdevices.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs wrote:
> On Mon, 2006-09-11 at 06:20 -0700, Say42 wrote:
>> That's what I want to do:
>> 1. Replace not very useful indexCorrelation with indexClustering.
>
> An opinion such as "not very useful" isn't considered sufficient
> explanation or justification for a change around here.

"Not sufficient for some types of data" would have been more fair.

I speculate that an new additional stat of
"average # of unique values for a column within a block"
would go a long way to helping my worst queries.

It's common here for queries to vastly overestimate the
number of pages that would need to be read because
postgresql's guess at the correlation being practically 0
despite the fact that the distinct values for any given
column are closely packed on a few pages.

Our biggest tables (180G or so) are mostly spatial data with columns
like "City" "State" "Zip" "County" "Street" "School District", "Police
Beat", "lat/long" etc; and we cluster the table on zip,street.

Note that practically all the rows for any single value of any
of the columns will lay in the same few blocks. However the
calculated "correlation" being low because the total ordering
of the other values doesn't match that of zip codes. This
makes the optimizer vastly overestimate the cost of index
scans because it guesses that most of the table will need
to be read, even though in reality just a few pages are needed.

If someone does look at the correlation calculations, I hope
this type of data gets considered as well.

I speculate that a new stat of
"average # of unique values for a column within a block"
could be useful here in addition to correlation. For most
all my columns in my big table, this stat would be 1 or 2;
which I think would be a useful hint that despite a low
"correlation", the distinct values are indeed packed together
in blocks. That way the optimizer can see that a
smaller number of pages would need to be accessed than
correlation alone would suggest.

Does this make sense, or am I missing something.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2006-09-13 17:30:20 Re: CVS commit messages and backpatching
Previous Message Bruce Momjian 2006-09-13 17:03:30 CVS commit messages and backpatching