Re: proposal : cross-column stats

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-24 13:50:39
Message-ID: 4D14A52F.8040302@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 24.12.2010 04:41, Florian Pflug napsal(a):
> The filter size could be derived from the table's statistics target, or
> be otherwise user-definable. We could also auto-resize once it gets too
> full. But still, that all seems awfully complex :-(

Using a statistics target is a good idea I think. I think we could use
it to determine error rate of the filter. Something like

error rate = 10 - 0.9 * (statistics_target - 100)

which gives

1% for statistics target 1000
10% for statistics target 100

or maybe something like this (where the error rate grows faster for
smaller statistic target values)

error rate = 11 - 91000 / (statistics_target^2)

which gives about

1% for statistics target 1000
10% for statistics targer 100
36% for statistics target 50

But I guess 10% error rate is the minimum we need so it does not make
much sense to use lower values.

>> > Another possibility is to collect the data from just a small portion
>> > of a table and then use the result to estimate the number of distinct
>> > values for the whole table. But I'm not sure we can do this reliably,
>> > I see many traps in this.
> This is how it works currently. The problem with this approach is that
> it gives you very little guarantees about how precise the result will be.
> Extrapolating works very well for things like MKVs and histograms, because
> there you're by definition interested mostly in values which occur often -
> and thus with a high probability in the relative few rows you sample. For
> the number of distinct values, however, this isn't true - if ndistinct
> is an order of magnitude smaller than the number of rows, relatively few
> rows can account for a large percentage of the distinct values...

That basically means we need to sample a large portion of the table :-(

> Another idea would be to obtain the ndistinct values from an index somehow.
> Postgres cannot currently scan an index in physical order, only in logical
> order, due to locking considerations. But since we'd only be interested in
> an estimate, maybe a scan in physical block order would work for ndistinc
> estimates? Just a wild idea, mind you, I haven't checked at all if that'd
> be even remotely feasible.

I was thinking about that too, and I think we could do this using
pageinspect contrib module. Sure, there might be a problem with bloated
indexes.

And relying on this actually means it's required to have a multi-column
index on all the columns. Individual indexes are not enough as we need
to get the number of distinct combinations too.

regards
Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2010-12-24 14:01:58 Re: Re: [COMMITTERS] pgsql: Move the documentation of --no-security-label to a more sensible
Previous Message Itagaki Takahiro 2010-12-24 13:36:15 Re: SQL/MED - file_fdw