Re: cost_sort() improvements

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Subject: Re: cost_sort() improvements
Date: 2018-07-22 20:13:39
Message-ID: b80be535-273f-b026-575b-c6fdc4f33e00@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/12/2018 05:00 PM, Teodor Sigaev wrote:
>
>> One more thought about estimating the worst case - I wonder if simply
>> multiplying the per-tuple cost by 1.5 is the right approach. It does not
>> seem particularly principled, and it's trivial simple to construct
>> counter-examples defeating it (imagine columns with 99% of the rows
>> having the same value, and then many values in exactly one row - that
>> will defeat the ndistinct-based group size estimates).
>
> Agree. But right now that is best what we have. and constant 1.5 easily
> could be changed to 1 or 10 - it is just arbitrary value, intuitively
> chosen.  As I mentioned in v7 patch description, new estimation function
> provides ~10% bigger estimation and this constant should not be very
> large, because we could easily get overestimation.
>
>>
>> The reason why constructing such counter-examples is so simple is that
>> this relies entirely on the ndistinct stats, ignoring the other stats we
>> already have. I think this might leverage the per-column MCV lists (and
>> eventually multi-column MCV lists, if that ever gets committed).
>>
>> We're estimating the number of tuples in group (after fixing values in
>> the preceding columns), because that's what determines the number of
>> comparisons we need to make at a given stage. The patch does this by
>> estimating the average group size, by estimating ndistinct of the
>> preceding columns G(i-1), and computing ntuples/G(i-1).
>>
>> But consider that we also have MCV lists for each column - if there is a
>> very common value, it should be there. So let's say M(i) is a frequency
>> of the most common value in i-th column, determined either from the MCV
>> list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)
>> for the fist i columns, then the largest possible group of values when
>> comparing values in i-th column is
>>
>>      N * m(i-1)
>>
>> This seems like a better way to estimate the worst case. I don't know if
>> this should be used instead of NG(i), or if those two estimates should
>> be combined in some way.
> Agree, definitely we need an estimation of larger group size to use it
> in cost_sort. But I don't feel power to do that, pls, could you attack a
> computing group size issue? Thank you.
>

Attached is a simple patch illustrating this MCV-based approach. I don't
claim it's finished / RFC, but hopefully it's sufficient to show what I
meant. I'm not sure how to plug it into the v8 of your patch at this
point, so I've simply added an elog to estimate_num_groups to also print
the estimate or largest group for GROUP BY queries.

It's largely a simplified copy of estimate_num_groups() and there's a
couple of comments of how it might be improved further.

>>
>> I think estimating the largest group we need to sort should be helpful
>> for the incremental sort patch, so I'm adding Alexander to this
>> thread.Agree
> I think so. suggested estimation algorithm should be modified to support
> leading preordered keys and this form could be directly used in GROUP BY
> and incremental sort patches.

Right. What I think the function estimating the group size could do in
case of incremental sort is producing two values - maximum size of the
leading group, and maximum group size overall. The first one would be
useful for startup cost, the second one for total cost.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
estimate-largest-group.patch text/x-patch 10.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-07-22 20:22:10 Re: cost_sort() improvements
Previous Message Fabien COELHO 2018-07-22 20:13:20 Re: pgbench: improve --help and --version parsing