Re: improving GROUP BY estimation

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Mark Dilger <hornschnorter(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: improving GROUP BY estimation
Date: 2016-03-03 19:05:09
Message-ID: CAPpHfdvRh9C-eifpeCD_eFwBn3do5CEyp9mEVon7HkEbDNAgPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 3, 2016 at 7:35 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On 03/03/2016 12:53 PM, Alexander Korotkov wrote:
>
>> I've assigned to review this patch.
>>
>> I've checked version estimate-num-groups-v2.txt by Mark Dilger.
>> It applies to head cleanly, passes corrected regression tests.
>>
>> About correlated/uncorrelated cases. I think now optimizer mostly assumes
>> all distributions to be independent. I think we should follow this
>> assumption in this case also until we have fundamentally better option
>> (like
>> your multivariate statistics).
>>
>> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List
>> *groupExprs,
>> double input_rows,
>> reldistinct = clamp;
>>
>> /*
>> -* Multiply by restriction selectivity.
>> +* Estimate number of distinct values expected in given number of rows.
>> */
>> -reldistinct *= rel->rows / rel->tuples;
>> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows));
>>
>> /*
>> * Update estimate of total distinct groups.
>>
>> I think we need way more explanation in comments here (possibly with
>> external links). For now, it looks like formula which appears from
>> nowhere.
>>
>
> I thought the details (particularly the link to stackexchange, discussing
> the formula) would be enough, but let me elaborate.
>
> The current formula
>
> reldistinct *= rel->rows / rel->tuples;
>
> simply multiplies the ndistinct estimate with selectivity. So when you
> select 1% of the table, the estimate says we'll see 1% of ndistinct values.
> But clearly that's naive, because for example when you have 10k distinct
> values and you select 10M rows, you should expect nearly all of them in the
> sample.
>
> And that's what the formula does - it gives you the expected number of
> distinct values, when selecting 'k' values from 'd' distinct values with
> replacements:
>
> count(k, d) = d * (1 - ((d-1)/d) ^ k)
>
> It's assuming the distinct values are equally probable (uniform
> distribution) and that the probabilities do not change (that's the "with
> replacement").
>
> I guess we could relax those assumptions and for example use the MCV
> statistics to further improve the estimate, and also relax the 'with
> replacement' but that'd make the formula way more complicated.
>
> [1]
> http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers

Your explanation in the first mail was good enough. Not it's even better.
But actually I mean that we need to include some brief explanation into
source code (either comments or readme). It would be good if one who want
to understand this could do without searching mailing list archives or git
history.

>> Also, note that rel->tuples gone away from formula parameters. So, we
>> actually don't care about how may tuples are in the table. This is because
>> this formula assumes that same tuple could be selected multiple times. For
>> low numbers this can lead to some errors. Consider selecting 2 from 3
>> distinct tuples. If you can't select same tuple twice then you always
>> selecting 2 distinct tuples. But according to formula you will select 5/3
>> in
>> average. I think this error in small numbers is negligible, especially
>> because getting rid of this error would require way more computations. But
>> it worth mentioning in comments though.
>>
>
> Well, yeah. That's the consequence of 'with replacement' assumption.
>
> I guess we could handle this somehow, though. For example we can compute
> the minimum possible number of distinct values like this:
>
> average_rows_per_value = (tuples / ndistinct);
> min_estimate = ceil(rows / average_rows_per_value);
>
> and use that as a minimum for the estimate. In the example you've given
> this would say
>
> average_rows_per_value = (3 / 3) = 1;
> min_estimate = ceil(2 / 1) = 2;
>
> So it serves as a correction for the 'with replacement' assumption. With
> only 2 distinct values we'd get this:
>
> average_rows_per_value = (3 / 2) = 1.5;
> min_estimate = ceil(2 / 1.5) = 2;
>
> Of course, it's just a heuristics, so this may fail in some other cases.

I'm not sure we actually need it. Even in worst case error doesn't seems to
be very big. But feel free to add this heuristics, it looks OK for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2016-03-03 19:06:14 Re: silent data loss with ext4 / all current versions
Previous Message Masahiko Sawada 2016-03-03 18:40:31 Re: Support for N synchronous standby servers - take 2