Re: POC: GROUP BY optimization

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, David Rowley <dgrowleyml(at)gmail(dot)com>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: POC: GROUP BY optimization
Date: 2024-01-16 15:05:33
Message-ID: CAPpHfdsfHn89cAFEa6-6Xm1sB_xf8Rf2WNab6DQzb8Rvs0yD8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Thank you for your review. The revised patchset is attached.

On Tue, Jan 16, 2024 at 4:48 AM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> On Mon, Jan 15, 2024 at 3:56 PM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>
>> On Mon, Jan 15, 2024 at 8:42 AM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
>> > On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>> >>
>> >> Thank you for providing the test case relevant for this code change.
>> >> The revised patch incorporating this change is attached. Now the
>> >> patchset looks good to me. I'm going to push it if there are no
>> >> objections.
>> >
>> > Seems I'm late for the party. Can we hold for several more days? I'd
>> > like to have a review on this patch.
>>
>> Sure, NP. I'll hold it till your review.
>
>
> Thanks. Appreciate that.
>
> I have briefly reviewed this patch and here are some comments.
>
> * When trying to match the ordering of GROUP BY to that of ORDER BY in
> get_useful_group_keys_orderings, this patch checks against the length of
> path->pathkeys. This does not make sense. I guess this is a typo and
> the real intention is to check against root->sort_pathkeys.
>
> --- a/src/backend/optimizer/path/pathkeys.c
> +++ b/src/backend/optimizer/path/pathkeys.c
> @@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
> root->num_groupby_pathkeys);
>
> if (n > 0 &&
> - (enable_incremental_sort || n == list_length(path->pathkeys)))
> + (enable_incremental_sort || n == list_length(root->sort_pathkeys)))
>
> However, I think this is also incorrect. When incremental sort is
> disabled, it is reasonable to consider reordering the GROUP BY keys only
> if the number of matching pathkeys is equal to the length of
> root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'.

Hmm... Why should this be list_length(root->group_pathkeys) while
we're targeting root->sort_pathkeys. I yet changed that to
list_length(root->sort_pathkeys).

> * When the original ordering of GROUP BY keys matches the path's
> pathkeys or ORDER BY keys, get_useful_group_keys_orderings would
> generate duplicate PathKeyInfos. Consequently, this duplication would
> lead to the creation and addition of identical paths multiple times.
> This is not great. Consider the query below:
>
> create table t (a int, b int);
> create index on t (a, b);
> set enable_hashagg to off;
>
> explain select count(*) from t group by a, b;
> QUERY PLAN
> ----------------------------------------------------------------------------------
> GroupAggregate (cost=0.15..97.27 rows=226 width=16)
> Group Key: a, b
> -> Index Only Scan using t_a_b_idx on t (cost=0.15..78.06 rows=2260 width=8)
> (3 rows)
>
> The same path with group keys {a, b} is created and added twice.

I tried to address that.

> * Part of the work performed in this patch overlaps with that of
> preprocess_groupclause. They are both trying to adjust the ordering of
> the GROUP BY keys to match ORDER BY. I wonder if it would be better to
> perform this work only once.

Andrei, could you take a look.

> * When reordering the GROUP BY keys, I think the following checks need
> some improvements.
>
> + /*
> + * Since 1349d27 pathkey coming from underlying node can be in the
> + * root->group_pathkeys but not in the processed_groupClause. So, we
> + * should be careful here.
> + */
> + sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,
> + *group_clauses);
> + if (!sgc)
> + /* The grouping clause does not cover this pathkey */
> + break;
>
> I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is
> valid before calling get_sortgroupref_clause_noerr with it. Also, how
> can we guarantee that the returned GROUP BY item is sortable? Should we
> check that with OidIsValid(sgc->sortop)?

Added.

------
Regards,
Alexander Korotkov

Attachment Content-Type Size
0001-Generalize-common-code-of-adding-sort-befor-20240116.patch application/octet-stream 9.1 KB
0002-Explore-alternative-orderings-of-group-by-p-20240116.patch application/octet-stream 36.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2024-01-16 15:08:45 Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Previous Message Ashutosh Bapat 2024-01-16 14:49:14 Re: Report planning memory in EXPLAIN ANALYZE