Re: Does "correlation" mislead the optimizer on large

From: Ron Mayer <ron(at)intervideo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Does "correlation" mislead the optimizer on large
Date: 2003-01-24 19:36:50
Message-ID: Pine.LNX.4.44.0301241123310.986-100000@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On Fri, 24 Jan 2003, Tom Lane wrote:
>
> Ron Mayer <ron(at)intervideo(dot)com> writes:
> > On a large tables, I think the "correlation" pg_stats field as calculated
> > by "vacuum analyze" or "analyze" can mislead the optimizer.
>
> If you look in the pghackers archives, you will find some discussion
> about changing the equation that cost_index() uses to estimate the
> impact of correlation on indexscan cost. The existing equation is
> ad-hoc and surely wrong, but so far no one's proposed a replacement
> that can be justified any better. If you've got such a replacement
> then we're all ears...

I've got a very slow one (full table scan perl script) that helps
my database... I don't know if it's a good general purpose solution.

That's why I'm asking if the concept is good here. :-)

> > In particular, if I have a large table t with columns 'a','b','c', etc,
> > and I cluster the table as follows:
> > create table t_ordered as select * from t order by a,b;
> > vacuum analyze t_ordered;
> > Column "b" will (correctly) get a very low "correlation" in
> > the pg_stats table -- but I think the optimizer would do better
> > assuming a high correlation because similar 'b' values are still
> > grouped closely on the same disk pages.
>
> How would that be? They'll be separated by the stride of 'a'.

In the case of date/time (for the queries I showed) the issue was
that 'a's were not at all unique so I had data like this:

dat | time | value
------------|----------|--------------------------------
2002-01-01 | 00:00:00 | whatever
2002-01-01 | 00:00:00 |
2002-01-01 | 00:00:00 |
2002-01-01 | 00:00:01 |
2002-01-01 | 00:00:01 | [many pages of 12am]
2002-01-01 | 00:00:01 |
2002-01-01 | 00:00:01 |
... thousands more rows....
2002-01-01 | 00:00:59 |
2002-01-01 | 00:01:00 | [many pages of 1am]
... tens of thousands of rows.
2002-01-01 | 23:59:59 |
2002-01-01 | 23:59:59 |
2002-01-01 | 23:59:59 | [many pages of 11pm]
2002-01-02 | 00:00:00 | [many *MORE* pages of 12am]
2002-01-02 | 00:00:00 |
2002-01-02 | 00:00:00 |
... tens of thousands of rows...
2002-01-02 | 23:59:59 | [many pages of 11pm]
2002-01-03 | 00:00:00 | [many *MORE* pages of 12am]
... millions more rows ...

A similar problem actually shows up again in the dimention tables
of my database; where I bulk load many pages at a time (which can
easily be ordered to give a good correlation for a single load) ...
but then the next week's data gets appended to the end.

id | value
------|----------------------------------
1 | aalok mehta [many pages of all 'a's]
2 | aamir khan
3 | aaron beall
| [...]
6234 | axel rose
6234 | austin wolf
6123 | barbara boxer [many pages of all 'b's]
| [...]
123456 | young
123457 | zebra
| [...data loaded later..]
123458 | aaron whatever [more pages of all 'a's]
123458 | aaron something else
123458 | aaron something else
| [...]
512344 | zelany

In this case I get many clustered blocks of "a" values, but these
clustered blocks happen at many different times across the table.

> It seems likely to me that a one-dimensional correlation statistic may
> be inadequate, but I haven't seen any proposals for better stats.

The idea is it walks the whole table and looks for more local
correlations and replaces the correlation value with a "good"
value if values "close" to each other on the disk are similar.

This way a single "correlation" value still works ... so I didn't
have to change the optimizer logic, just the "analyze" logic.

Basically if data within each block is highly correlated, it doesn't
matter as much (yeah, I now the issue about sequential reads vs. random
reads).

Ron

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Ron Mayer 2003-01-24 20:04:12 Re: Does "correlation" mislead the optimizer on large
Previous Message Josh Berkus 2003-01-24 19:20:14 Mount options for Ext3?