Re: POC: GROUP BY optimization

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2019-05-03 20:28:21
Message-ID: CA+q6zcVRrd-z4YZ4M43ccst7aGL9==w5r1fionRWhP9ot6mybQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Tue, Apr 9, 2019 at 5:21 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> So I personally would suggest to treat those patches as independent until
> the very last moment, develop the costing improvements needed by each
> of them, and then decide which of them are committable / in what order.

I had the same idea, but judging from the questions, raised in this thread,
it's quite hard to go with reordering based only on frequency of values. I
hoped that the cost_sort improvement patch would be simple enough to
incorporate it here, but of course it wasn't. Having an assumption, that the
amount of work, required for performing sorting, depends only on the number of
distinct groups and how costly it is to compare a values of this data type,
I've ended up extracting get_width_multiplier and get_func_cost parts from
cost_sort patch and including them into 0003-Reorder-by-values-distribution.
This allows to take into account situations when we compare e.g. long strings
or a custom data type with high procost for comparison (but I've used this
values directly without any adjusting coefficients yet).

> On Wed, Jun 13, 2018 at 6:41 PM Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
>
> > So that's a nice improvement, although I think we should also consider
> > non-uniform distributions (using the per-column MCV lists).
>
> Could you clarify how to do that?

Since I'm not familiar with this topic, I would like to ask the same question,
how to do that and what are the advantages?

> On Sat, Jun 16, 2018 at 5:59 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> I still think we need to be careful when introducing new optimizations
> in this area - reordering the grouping keys by ndistinct, ORDER BY or
> whatever. In particular I don't think we should commit these patches
> that may quite easily cause regressions, and then hope some hypothetical
> future patch fixes the costing.

I'm a bit concerned about this part of the discussion. There is an idea through
the whole thread about avoiding the situation, when a user knows which order is
better and we generate different one by mistake. From what I see right now even
if all the objections would be addressed, there is a chance that some
coefficients will be not good enough (e.g. width multiplier is based on an
average width, or it can suddenly happen that all the compared string have some
difference at the very beginning) and the chosen order will be not optimal.
Does it mean that in any case the implementation of such optimization should
provide a way to override it?

Attachment Content-Type Size
v12-0001-Add-tests-for-group-by-optimization.patch application/octet-stream 9.0 KB
v12-0002-Reorder-to-match-ORDER-BY-or-index.patch application/octet-stream 15.6 KB
v12-0003-Reorder-by-values-distribution.patch application/octet-stream 13.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2019-05-03 20:47:27 Re: improving wraparound behavior
Previous Message Robert Haas 2019-05-03 20:26:46 improving wraparound behavior