Quick Links

Re: POC: GROUP BY optimization

From:	Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>
To:	Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject:	Re: POC: GROUP BY optimization
Date:	2021-07-13 11:37:01
Message-ID:	CALtqXTe4rKXHXxNNGVy0wHVwGYYJpy2QNPTN29qq+ErBJJakrA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 10, 2021 at 4:05 AM Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
wrote:

> Hi,
>
> I take a look at the patch today. Attached is the v12 and a separate
> patch with some comment tweaks and review comments.
>
>
> 1) I see the patch results in some plan changes in postgres_fdw. I
> assume it's somehow related to the sort costing changes, but I find it a
> bit suspicious. Why should the plan change from this:
>
> Foreign Scan
> Remote SQL: ... ORDER BY ... OFFSET 100 LIMIT 10;
>
> to
>
> Limit
> Foreign Scan
> Remote SQL: ... ORDER BY ...
>
> But even if this is due to changes to order by costing, postponing the
> limit seems like a net loss - the sort happens before the limit, and by
> postponing the limit after foreign scan we need to transfer 100 more
> rows. So what's causing this?
>
> The difference in plan cost seems fairly substantial (old: 196.52, new:
> 241.94). But maybe that's OK and expected.
>
>
> 2) I wonder how much more expensive (in terms of CPU) is the new sort
> costing code. It's iterative, and it's calling functions to calculate
> number of groups etc. I wonder what's the impact simple queries?
>
>
> 3) It's not clear to me why we need "fake var" protection? Why would the
> estimate_num_groups() fail for that?
>
>
> 4) In one of the places a comment says DEFAULT_NUM_DISTINCT is not used
> because we're dealing with multiple columns. That makes sense, but maybe
> we should use DEFAULT_NUM_DISTINCT at least to limit the range. That is,
> with N columns we should restrict the nGroups estimate by:
>
> min = DEFAULT_NUM_DISTINCT
> max = Min(pow(DEFAULT_NUM_DISTINCT, ncolumns), tuples)
>
> Also, it's not clear what's the logic behind the formula:
>
> nGroups = ceil(2.0 + sqrt(tuples) *
> list_length(pathkeyExprs) / list_length(pathkeys));
>
> A comment explaining that would be helpful.
>
>
> 5) The compute_cpu_sort_cost comment includes two references (in a quite
> mixed-up way), and then there's another reference in the code. I suggest
> we list all of them in the function comment.
>
>
> 6) There's a bunch of magic values (various multipliers etc.). It's not
> clear which values are meant to be equal, etc. Let's introduce nicer
> constants with reasonable names.
>
>
> 7) The comment at get_cheapest_group_keys_order is a bit misleading,
> because it talks about "uniqueness" - but that's not what we're looking
> at here (I mean, is the column unique or not). We're looking at number
> of distinct values in the column, which is a different thing. Also, it'd
> be good to roughly explain what get_cheapest_group_keys_order does - how
> it calculates the order (permutations up to 4 values, etc.).
>
>
> 8) The comment at compute_cpu_sort_cost should also explain basics of
> the algorithm. I tried doing something like that, but I'm not sure I got
> all the details right and it probably needs further improvements.
>
>
> 9) The main concern I have is still about the changes in planner.c, and
> I think I've already shared it before. The code takes the grouping
> pathkeys, and just reshuffles them to what it believes is a better order
> (cheaper, ...). That is, there's on input pathkey list, and it gets
> transformed into another pathkey list. The problem is that even if this
> optimizes the sort, it may work against some optimization in the upper
> part of the plan.
>
> Imagine a query that does something like this:
>
> SELECT a, b, count(*) FROM (
> SELECT DISTINCT a, b, c, d FROM x
> ) GROUP BY a, b;
>
> Now, from the costing perspective, it may be cheaper to do the inner
> grouping by "c, d, a, b" for example. But that works directly against
> the second grouping, which would benefit from having the input sorted by
> "a, b". How exactly would this behaves depends on the number of distinct
> values in the columns, how expensive the comparisons are for each
> column, and so on. But you get the idea.
>
> I admit I haven't managed to construct a reasonably query that'd be
> significantly harmed by this, but I haven't spent a lot of time on it.
>
> I'm pretty sure I used this trick in the past, when doing aggregations
> on large data sets, where it was much better to "pipe" the data through
> multiple GroupAggregate nodes.
>
> For this reason I think the code generating paths should work a bit like
> get_useful_pathkeys_for_relation and generate_useful_gather_paths.
>
> * group_keys_reorder_by_pathkeys should not produce just one list of
> pathkeys, but multiple interesting options. At the moment it should
> produce at least the input grouping pathkeys (as specified in the
> query), and the "optimal" pathkeys (by lowest cost).
>
> * the places calling group_keys_reorder_by_pathkeys should loop on the
> result, and generate separate path for each option.
>
> I'd guess in the future we'll "peek forward" in the plan and see if
> there are other interesting pathkeys (that's the expectation for
> get_useful_pathkeys_for_relation).
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

The patch does not apply successfully, can you please take a look at that

http://cfbot.cputube.org/patch_33_1651.log

=== applying patch ./0001-v12-20210309.patch

atching file src/include/utils/selfuncs.h
Hunk #1 FAILED at 198.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/utils/selfuncs.h.rej

Based on @Tomas Vondra comments, and patch does not apply, I am changing
the status to "Waiting On Author".

Ibrar Ahmed

In response to

Re: POC: GROUP BY optimization at 2021-03-09 23:05:55 from Tomas Vondra

Responses

Re: POC: GROUP BY optimization at 2021-07-20 13:12:38 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ranier Vilela	2021-07-13 11:45:20	Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Previous Message	Ranier Vilela	2021-07-13 11:24:55	Re: Transactions involving multiple postgres foreign servers, take 2