Re: new correlation metric

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org, npboley(at)gmail(dot)com
Subject: Re: new correlation metric
Date: 2008-10-26 16:44:06
Message-ID: 11984.1225039446@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> I think the code is in the right direction, but I think want you want
> is some kind of estimate of "given I've looked for tuple X, how many
> tuples in the next k pages are near this one". Unfortunatly I don't see
> a way of calculating it other than a full simulation.

I wonder whether we ought to rethink the problem entirely. In
particular, the notion of associating correlation stats with particular
columns doesn't seem amazingly useful when you get right down to it.
We're wasting our time calculating the correlation of a column that has
no index; and if the column is part of a multicolumn index it's not that
easy to figure out what the index correlation is from the per-column
numbers. Not to mention functional and partial indexes.

So it occurs to me that maybe we should forget about per-column
correlations altogether, and instead try directly to calculate *per
index* correlations. You could do this now by doing an index-only scan
and looking at the series of tuple block numbers that come back.
However, there's no obvious way to do any sampling in that approach
--- you can't read random subsets of the index without cooperation from
the index AM.

So the direction I'm headed in is to imagine that we should add an
"analyze" entry point to the index AM API, with the definition that it
will be called once per index per ANALYZE, and is in charge of putting
useful stats into someplace or other. We might need to invent some
other catalog besides pg_statistic if we want to represent per-index
properties like correlation. A minimal solution would be to add a
correlation column to pg_class or pg_index, but that doesn't scale well
if you imagine that different index AMs might want different sorts of
stats.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2008-10-26 18:32:17 array_agg and array_accum (patch)
Previous Message Martin Gainty 2008-10-26 16:05:54 Re: again...