Quick Links

Re: POC: GROUP BY optimization

From:	Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Richard Guo <guofenglinux(at)gmail(dot)com>
Cc:	Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, David Rowley <dgrowleyml(at)gmail(dot)com>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: POC: GROUP BY optimization
Date:	2024-01-18 09:48:54
Message-ID:	f7a8729d-86d4-43ad-8b68-a7a3ace9d0fc@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 16/1/2024 22:05, Alexander Korotkov wrote:
> On Tue, Jan 16, 2024 at 4:48 AM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
>> * When trying to match the ordering of GROUP BY to that of ORDER BY in
>> get_useful_group_keys_orderings, this patch checks against the length of
>> path->pathkeys. This does not make sense. I guess this is a typo and
>> the real intention is to check against root->sort_pathkeys.
Thanks! It is really my blunder - fresh look works.
>>
>> --- a/src/backend/optimizer/path/pathkeys.c
>> +++ b/src/backend/optimizer/path/pathkeys.c
>> @@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path *path)
>> root->num_groupby_pathkeys);
>>
>> if (n > 0 &&
>> - (enable_incremental_sort || n == list_length(path->pathkeys)))
>> + (enable_incremental_sort || n == list_length(root->sort_pathkeys)))
>>
>> However, I think this is also incorrect. When incremental sort is
>> disabled, it is reasonable to consider reordering the GROUP BY keys only
>> if the number of matching pathkeys is equal to the length of
>> root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'.
>
> Hmm... Why should this be list_length(root->group_pathkeys) while
> we're targeting root->sort_pathkeys. I yet changed that to
> list_length(root->sort_pathkeys).
I think, in the first case, when we are trying to arrange group-by keys
according to underlying pathkeys with incremental sort = off, it makes
sense to do if we fetch all group-by keys regardless of a more or equal
number of path keys the incoming path contains. The code and test case
are in step1.txt.
>
>> * When the original ordering of GROUP BY keys matches the path's
>> pathkeys or ORDER BY keys, get_useful_group_keys_orderings would
>> generate duplicate PathKeyInfos. Consequently, this duplication would
>> lead to the creation and addition of identical paths multiple times.
>> This is not great. Consider the query below:
>>
>> create table t (a int, b int);
>> create index on t (a, b);
>> set enable_hashagg to off;
>>
>> explain select count(*) from t group by a, b;
>> QUERY PLAN
>> ----------------------------------------------------------------------------------
>> GroupAggregate (cost=0.15..97.27 rows=226 width=16)
>> Group Key: a, b
>> -> Index Only Scan using t_a_b_idx on t (cost=0.15..78.06 rows=2260 width=8)
>> (3 rows)
>>
>> The same path with group keys {a, b} is created and added twice.
>
> I tried to address that.
>
>> * Part of the work performed in this patch overlaps with that of
>> preprocess_groupclause. They are both trying to adjust the ordering of
>> the GROUP BY keys to match ORDER BY. I wonder if it would be better to
>> perform this work only once.
>
> Andrei, could you take a look.
As I see, the PathKeyInfo list often contains duplicated pathkeys,
coming from either sort_pathkeys or path->pathkeys orderings. So, I can
propose to check duplicates each time (see step2.txt in attachment).
>
>> * When reordering the GROUP BY keys, I think the following checks need
>> some improvements.
>>
>> + /*
>> + * Since 1349d27 pathkey coming from underlying node can be in the
>> + * root->group_pathkeys but not in the processed_groupClause. So, we
>> + * should be careful here.
>> + */
>> + sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,
>> + *group_clauses);
>> + if (!sgc)
>> + /* The grouping clause does not cover this pathkey */
>> + break;
>>
>> I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is
>> valid before calling get_sortgroupref_clause_noerr with it. Also, how
>> can we guarantee that the returned GROUP BY item is sortable? Should we
>> check that with OidIsValid(sgc->sortop)?
>
> Added.
Reviewed it, looks good.

--
regards,
Andrei Lepikhov
Postgres Professional

Attachment	Content-Type	Size
step1.txt	text/plain	2.9 KB

In response to

Re: POC: GROUP BY optimization at 2024-01-16 15:05:33 from Alexander Korotkov

Responses

Re: POC: GROUP BY optimization at 2024-01-18 14:18:34 from Andrei Lepikhov
Re: POC: GROUP BY optimization at 2024-01-19 14:21:48 from Alexander Korotkov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Matthias van de Meent	2024-01-18 09:49:48	Re: cleanup patches for incremental backup
Previous Message	Amit Kapila	2024-01-18 09:43:56	Re: speed up a logical replica setup