Re: Group-count estimation statistics

From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Group-count estimation statistics
Date: 2005-01-28 18:29:10
Message-ID: 87u0p1tnk9.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> The reason this happens even with stats is that the algorithm for
> estimating the number of groups in a multi-group-column situation
> is basically "take the product of the number of distinct values of
> each grouping column, but clamp to the number of rows estimated for
> the whole table". It turns out that customer_id and item_id are
> pretty highly correlated, enough that the actual number of distinct
> groups is barely a hundredth of what you get from the product rule.

So why is it any more reasonable for Postgres to assume 0 correlation than any
other value. Perhaps Postgres should calculate these cases assuming some
arbitrary level of correlation.

For example, pulling an algorithm out of nowhere, it could take the most
selective value, then multiply -- instead of by the next most selective --
just the square root of the next value, then the cube root of the third value,
and the fourth root of the fourth value, etc.

So for 10M records grouped over three columns, each of which has 1,000
distinct values, you would get 1,000 * 1,000 ^ 1/2 * 1,000 ^ 1/3 or 316,228
distinct values. Which seems like not a bad guess actually.

To be honest I took the "successive roots" thing out of nowhere. I suspect
there's a more rigorous statistics approach which would have some actual
motivation. For a given assumed correlation and distribution there ought be a
way to calculate the expected number of distinct combinations. Then when we
have some mechanism for providing real correlation we can just plug that in in
place of the arbitrarily assumed correlation.

Actually the formula would be quite complex. As the total number of records
goes up the expected number of distinct values should approach the total
number of records, even if the number of distinct values of each column
doesn't change.

> The only real solution, of course, is to acquire cross-column
> statistics, but I don't see that happening in the near future.

There's another possible solution, if Postgres kept statistics on the actual
results of the query it could later use that feedback to come up with better
guesses even if it doesn't know *why* they're better.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Darcy Buskermolen 2005-01-28 18:42:40 -HEAD on FreeBSD 6-CURRENT build failures
Previous Message Stephen Frost 2005-01-28 18:25:40 Re: Group-count estimation statistics