Re: POC: GROUP BY optimization

From: "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: POC: GROUP BY optimization
Date: 2022-01-20 05:05:57
Message-ID: dc933c66-41b9-8907-327e-f6361a690654@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I keep work on this patch. Here is intermediate results.

On 7/22/21 3:58 AM, Tomas Vondra wrote:
> in the first loop. Which seems pretty bogus - why would there be just
> two groups? When processing the first expression, it's as if there was
> one big "prev group" with all the tuples, so why not to just use nGroups
> as it is?

I think, heapsort code seems very strange. Look into fallback case. It
based on an output_tuples value. Maybe we should use nGroups value here,
but based on a number of output_tuples?

> 1) I looked at the resources mentioned as sources the formulas came
> from, but I've been unable to really match the algorithm to them. The
> quicksort paper is particularly "dense", the notation seems to be very
> different, and none of the theorems seem like an obvious fit. Would be
> good to make the relationship clearer in comments etc.

Fixed (See attachment).

> 3) I'm getting a bit skeptical about the various magic coefficients that
> are meant to model higher costs with non-uniform distribution. But
> consider that we do this, for example:
>
> tuplesPerPrevGroup = ceil(1.5 * tuplesPerPrevGroup / nGroups);
>
> but then in the next loop we call estimate_num_groups_incremental and
> pass this "tweaked" tuplesPerPrevGroup value to it. I'm pretty sure this
> may have various strange consequences - we'll calculate the nGroups
> based on the inflated value, and we'll calculate tuplesPerPrevGroup from
> that again - that seems susceptible to "amplification".
>
> We inflate tuplesPerPrevGroup by 50%, which means we'll get a higher
> nGroups estimate in the next loop - but not linearly. An then we'll
> calculate the inflated tuplesPerPrevGroup and estimated nGroup ...

Weighting coefficient '1.5' shows our desire to minimize the number of
comparison operations on each next attribute of a pathkeys list.
Increasing this coef we increase a chance, that planner will order
pathkeys by decreasing of uniqueness.
I think, it's ok.

--
regards,
Andrey Lepikhov
Postgres Professional

Attachment Content-Type Size
fixes.txt text/plain 7.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-01-20 05:32:43 Re: Refactoring of compression options in pg_basebackup
Previous Message Michael Paquier 2022-01-20 04:38:45 Re: pg_upgrade should truncate/remove its logs before running