Re: proposal : cross-column stats

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-19 21:39:19
Message-ID: 4D0E7B87.1080909@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 19.12.2010 21:21, Simon Riggs napsal(a):
> On Mon, 2010-12-13 at 10:38 -0500, Tom Lane wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> On Sun, Dec 12, 2010 at 9:16 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>>> The proposed solution is based on contingency tables, built for selected
>>>> groups of columns (not for each possible group). And the contingency
>>>> table gives you the ability to estimate the probabilities needed to
>>>> compute the selectivity. Or am I missing something?
>>
>>> Well, I'm not real familiar with contingency tables, but it seems like
>>> you could end up needing to store a huge amount of data to get any
>>> benefit out of it, in some cases.
>>
>> The reason that this wasn't done years ago is precisely that nobody's
>> figured out how to do it with a tolerable amount of stats data and a
>> tolerable amount of processing time (both at ANALYZE time and during
>> query planning). It's not hard to see what we'd ideally like to do;
>> it's getting from there to something useful in production that's hard.
>
> I think we have to face up to the fact that attempting to derive
> meaningful cross-column stats will require larger sample sizes.

Amen.

> If we collect everything we're going to have ~10^9 stats slots with
> default stats_target 100 and a 100 column table.
>
> We should disconnect sample size from histogram size, and we need to
> make the initial column pairings vastly fewer than all combinations.
> Manual specification seems like it will be required for the cases where
> we decide not to include it automatically, so it seems we'll need manual
> specification anyway. In that case, we should do manual specification
> first.

Well, not really. The more bins you have, the larger sample you need to
get a representative representation of the stats. So the histogram and
sample size are naturally connected.

And there are some (mostly heuristics) rules to determine how large the
sample should be. E.g. when building a contingency table for a
chi-squared test, a common rule is that each bin should contain at least
5 values. So the more bins you have, the larger sample you need.

I like the way oracle does this - you can either let them decide what is
the proper sample size, or you can specify how large the sample should
be (what portion of the table).

So the tricky part here is determining the number of bins in the
histogram. In the one-dimensional case, stats_target=100 actually means
each bin contains about 1% of data. So I guess we should use a similar
approach in the multi-dimensional case too, i.e. let the user determine
a desired precision and then derive the number of bins automatically
(which is a bit tricky because of the multiple dimensions).

But yeah, in general you're right - this will require larger samples,
more complex stats to collect, we'll need some space to store the
collected stats ...

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2010-12-19 21:54:37 Re: keeping a timestamp of the last stats reset (for a db, table and function)
Previous Message 3dmashup 2010-12-19 20:39:27 Re: Amazon now supporting GPU focused EC2 instances