|From:||Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>|
|To:||Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>|
|Subject:||Re: cost_sort() improvements|
|Views:||Raw Message | Whole Thread | Download mbox|
On 07/12/2018 05:00 PM, Teodor Sigaev wrote:
>> One more thought about estimating the worst case - I wonder if simply
>> multiplying the per-tuple cost by 1.5 is the right approach. It does not
>> seem particularly principled, and it's trivial simple to construct
>> counter-examples defeating it (imagine columns with 99% of the rows
>> having the same value, and then many values in exactly one row - that
>> will defeat the ndistinct-based group size estimates).
> Agree. But right now that is best what we have. and constant 1.5 easily
> could be changed to 1 or 10 - it is just arbitrary value, intuitively
> chosen. As I mentioned in v7 patch description, new estimation function
> provides ~10% bigger estimation and this constant should not be very
> large, because we could easily get overestimation.
>> The reason why constructing such counter-examples is so simple is that
>> this relies entirely on the ndistinct stats, ignoring the other stats we
>> already have. I think this might leverage the per-column MCV lists (and
>> eventually multi-column MCV lists, if that ever gets committed).
>> We're estimating the number of tuples in group (after fixing values in
>> the preceding columns), because that's what determines the number of
>> comparisons we need to make at a given stage. The patch does this by
>> estimating the average group size, by estimating ndistinct of the
>> preceding columns G(i-1), and computing ntuples/G(i-1).
>> But consider that we also have MCV lists for each column - if there is a
>> very common value, it should be there. So let's say M(i) is a frequency
>> of the most common value in i-th column, determined either from the MCV
>> list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)
>> for the fist i columns, then the largest possible group of values when
>> comparing values in i-th column is
>> N * m(i-1)
>> This seems like a better way to estimate the worst case. I don't know if
>> this should be used instead of NG(i), or if those two estimates should
>> be combined in some way.
> Agree, definitely we need an estimation of larger group size to use it
> in cost_sort. But I don't feel power to do that, pls, could you attack a
> computing group size issue? Thank you.
Attached is a simple patch illustrating this MCV-based approach. I don't
claim it's finished / RFC, but hopefully it's sufficient to show what I
meant. I'm not sure how to plug it into the v8 of your patch at this
point, so I've simply added an elog to estimate_num_groups to also print
the estimate or largest group for GROUP BY queries.
It's largely a simplified copy of estimate_num_groups() and there's a
couple of comments of how it might be improved further.
>> I think estimating the largest group we need to sort should be helpful
>> for the incremental sort patch, so I'm adding Alexander to this
> I think so. suggested estimation algorithm should be modified to support
> leading preordered keys and this form could be directly used in GROUP BY
> and incremental sort patches.
Right. What I think the function estimating the group size could do in
case of incremental sort is producing two values - maximum size of the
leading group, and maximum group size overall. The first one would be
useful for startup cost, the second one for total cost.
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
|Next Message||Tomas Vondra||2018-07-22 20:22:10||Re: cost_sort() improvements|
|Previous Message||Fabien COELHO||2018-07-22 20:13:20||Re: pgbench: improve --help and --version parsing|