Re: proposal : cross-column stats

From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-12 14:17:36
Message-ID: 20101212141735.GA6602@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Hi everyone,
>
> one of the ssesion I've attended on PgDay last week was Heikki's session
> about statistics in PostgreSQL. One of the issues he mentioned (and one
> I regularly run into) is the absence of cross-column stats. When the
> columns are not independent, this usually result in poor estimates (and
> then in suboptimal plans).

Very cool that you're working on this.

> Lets talk about one special case - I'll explain how the proposed
> solution works, and then I'll explain how to make it more general, what
> improvements are possible, what issues are there. Anyway this is by no
> means a perfect or complete solution - it's just a starting point.

It looks like you handled most of the issues. Just a few points:

- This is obviously applicable to more than just integers, probably
anything with a b-tree operator class. What you've coded seems rely
on calculations on the values. Have you thought about how it could
work for, for example, strings?

The classic failure case has always been: postcodes and city names.
Strongly correlated, but in a way that the computer can't easily see.
Not that I suggest you fix this, but it's food for though. Though
strictly speaking this is a different kind of correlation than what
you're looking at.

> 2) I really don't think we should collect stats for all combinations of
> columns of a table - I do like the Oracle-like approach where a DBA
> has to enable cross-column stats using an ALTER TABLE (for a
> particular list of columns).
>
> The only exception might be columns from a multi-column index. It
> might be quite efficient I guess?

In the past it has been suggested to only do it for multi-column
indexes, but I find these days I find in some situations I prefer to
make individual indexes and let the bitmap scan code combine them. So
perhaps it would be best to let it be configured by the DBA.

> 3) There are independence tests for contingency tables (e.g. Pearson's
> Chi-squared test), so that it's easy to find out whether the columns
> are independent. In that case we can just throw away these stats and
> use the simple estimation.
>
> http://mathworld.wolfram.com/Chi-SquaredTest.html

I think this would be good to include, if possible.

Actually, I wonder if the existing stats collection code could be
altered to attempt to calculate the correlation between columns as part
of its other work.

> 4) Or we could store just those cells where expected and observed values
> differ significantly (may help if most of the values are indendent,
> but there's a small glitch somewhere).

Comrpessing that grid would be useful, given that for many dimensions
most of the grid will be not interesting. In fact, storing the 20
largest values may be enough. Worth an experiment.

Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
> - Charles de Gaulle

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2010-12-12 14:32:01 Re: Extensions, patch v16
Previous Message Pavel Stehule 2010-12-12 12:14:04 Re: proposal: auxiliary functions for record type