Re: proposal : cross-column stats

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-17 22:12:16
Message-ID: 4D0BE040.5070604@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
> On 17.12.2010 23:13, Tomas Vondra wrote:
>> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>>> I haven't read the paper yet (sorry) but just off the top of my head,
>>> one possible problem here is that our n_distinct estimates aren't
>>> always very accurate, especially for large tables. As we've discussed
>>> before, making them accurate requires sampling a significant
>>> percentage of the table, whereas all of our other statistics can be
>>> computed reasonably accurately by sampling a fixed amount of an
>>> arbitrarily large table. So it's possible that relying more heavily
>>> on n_distinct could turn out worse overall even if the algorithm is
>>> better. Not sure if that's an issue here, just throwing it out
>>> there...
>>
>> Yes, you're right - the paper really is based on (estimates of) number
>> of distinct values for each of the columns as well as for the group of
>> columns.
>>
>> AFAIK it will work with reasonably precise estimates, but the point is
>> you need an estimate of distinct values of the whole group of columns.
>> So when you want to get an estimate for queries on columns (a,b), you
>> need the number of distinct value combinations of these two columns.
>>
>> And I think we're not collecting this right now, so this solution
>> requires scanning the table (or some part of it).
>
> Any idea how sensitive it is to the accuracy of that estimate on
> distinct value combinations? If we get that off by a factor of ten or a
> hundred, what kind of an effect does it have on the final cost estimates?

Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is

(dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))

where A and B are columns, dist(X) is number of distinct values in
column X and sel(X) is selectivity of column X.

dependency on dist(A), dist(B) and dist(A,C)
--------------------------------------------

So it seems to be proportional to dist(A) and dist(B), and inverse
proportional to dist(A,B). So when increasing the dist(A) and dist(B)
10x, you'll get a 10x the original estimate. Similarly, by increasing
the dist(A,B) 10x, you'll get 10x lower estimate.

upper bound
-----------

Unless dist(A) or dist(B) is greater than dist(A,B), the estimated
selectivity is bounded by

(sel(A) + sel(B)) / 2

I guess we can say that (sel(A) > sel(A,B)) is generally the same as

sel(A) = sel(A,B)

i.e. we can use the heuristict I've already offered in the PoC.

lower bound
-----------

Not sure what the lower bound is, but it might be 0 if the dist(A) and
dist(B) are very small and/or dist(A,B) is huge.

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2010-12-17 22:12:50 psql expanded auto
Previous Message Josh Berkus 2010-12-17 22:05:49 Re: Why don't we accept exponential format for integers?