Quick Links

Re: POC: GROUP BY optimization

From:	"Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: POC: GROUP BY optimization
Date:	2022-01-20 05:05:57
Message-ID:	dc933c66-41b9-8907-327e-f6361a690654@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

I keep work on this patch. Here is intermediate results.

On 7/22/21 3:58 AM, Tomas Vondra wrote:
> in the first loop. Which seems pretty bogus - why would there be just
> two groups? When processing the first expression, it's as if there was
> one big "prev group" with all the tuples, so why not to just use nGroups
> as it is?

I think, heapsort code seems very strange. Look into fallback case. It
based on an output_tuples value. Maybe we should use nGroups value here,
but based on a number of output_tuples?

> 1) I looked at the resources mentioned as sources the formulas came
> from, but I've been unable to really match the algorithm to them. The
> quicksort paper is particularly "dense", the notation seems to be very
> different, and none of the theorems seem like an obvious fit. Would be
> good to make the relationship clearer in comments etc.

Fixed (See attachment).

> 3) I'm getting a bit skeptical about the various magic coefficients that
> are meant to model higher costs with non-uniform distribution. But
> consider that we do this, for example:
>
> tuplesPerPrevGroup = ceil(1.5 * tuplesPerPrevGroup / nGroups);
>
> but then in the next loop we call estimate_num_groups_incremental and
> pass this "tweaked" tuplesPerPrevGroup value to it. I'm pretty sure this
> may have various strange consequences - we'll calculate the nGroups
> based on the inflated value, and we'll calculate tuplesPerPrevGroup from
> that again - that seems susceptible to "amplification".
>
> We inflate tuplesPerPrevGroup by 50%, which means we'll get a higher
> nGroups estimate in the next loop - but not linearly. An then we'll
> calculate the inflated tuplesPerPrevGroup and estimated nGroup ...

Weighting coefficient '1.5' shows our desire to minimize the number of
comparison operations on each next attribute of a pathkeys list.
Increasing this coef we increase a chance, that planner will order
pathkeys by decreasing of uniqueness.
I think, it's ok.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachment	Content-Type	Size
fixes.txt	text/plain	7.3 KB

In response to

Re: POC: GROUP BY optimization at 2021-07-21 22:58:44 from Tomas Vondra

Responses

Re: POC: GROUP BY optimization at 2022-01-21 11:09:01 from Andrey Lepikhov
Re: POC: GROUP BY optimization at 2022-02-01 09:48:11 from Andrey V. Lepikhov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2022-01-20 05:32:43	Re: Refactoring of compression options in pg_basebackup
Previous Message	Michael Paquier	2022-01-20 04:38:45	Re: pg_upgrade should truncate/remove its logs before running