From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: proposal : cross-column stats |
Date: | 2010-12-17 22:12:16 |
Message-ID: | 4D0BE040.5070604@fuzzy.cz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dne 17.12.2010 22:41, Heikki Linnakangas napsal(a):
> On 17.12.2010 23:13, Tomas Vondra wrote:
>> Dne 17.12.2010 19:58, Robert Haas napsal(a):
>>> I haven't read the paper yet (sorry) but just off the top of my head,
>>> one possible problem here is that our n_distinct estimates aren't
>>> always very accurate, especially for large tables. As we've discussed
>>> before, making them accurate requires sampling a significant
>>> percentage of the table, whereas all of our other statistics can be
>>> computed reasonably accurately by sampling a fixed amount of an
>>> arbitrarily large table. So it's possible that relying more heavily
>>> on n_distinct could turn out worse overall even if the algorithm is
>>> better. Not sure if that's an issue here, just throwing it out
>>> there...
>>
>> Yes, you're right - the paper really is based on (estimates of) number
>> of distinct values for each of the columns as well as for the group of
>> columns.
>>
>> AFAIK it will work with reasonably precise estimates, but the point is
>> you need an estimate of distinct values of the whole group of columns.
>> So when you want to get an estimate for queries on columns (a,b), you
>> need the number of distinct value combinations of these two columns.
>>
>> And I think we're not collecting this right now, so this solution
>> requires scanning the table (or some part of it).
>
> Any idea how sensitive it is to the accuracy of that estimate on
> distinct value combinations? If we get that off by a factor of ten or a
> hundred, what kind of an effect does it have on the final cost estimates?
Well, not really - I haven't done any experiments with it. For two
columns selectivity equation is
(dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
where A and B are columns, dist(X) is number of distinct values in
column X and sel(X) is selectivity of column X.
dependency on dist(A), dist(B) and dist(A,C)
--------------------------------------------
So it seems to be proportional to dist(A) and dist(B), and inverse
proportional to dist(A,B). So when increasing the dist(A) and dist(B)
10x, you'll get a 10x the original estimate. Similarly, by increasing
the dist(A,B) 10x, you'll get 10x lower estimate.
upper bound
-----------
Unless dist(A) or dist(B) is greater than dist(A,B), the estimated
selectivity is bounded by
(sel(A) + sel(B)) / 2
I guess we can say that (sel(A) > sel(A,B)) is generally the same as
sel(A) = sel(A,B)
i.e. we can use the heuristict I've already offered in the PoC.
lower bound
-----------
Not sure what the lower bound is, but it might be 0 if the dist(A) and
dist(B) are very small and/or dist(A,B) is huge.
regards
Tomas
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2010-12-17 22:12:50 | psql expanded auto |
Previous Message | Josh Berkus | 2010-12-17 22:05:49 | Re: Why don't we accept exponential format for integers? |