Re: Parallel Aggregates for string_agg and array_agg

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Parallel Aggregates for string_agg and array_agg
Date: 2022-08-03 01:42:47
Message-ID: CAApHDvq71ssfOK1=dUsLPmf_cYntyC4PHR4C72+=UQmRsC3tTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 3 Aug 2022 at 12:18, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Now as against that, if the underlying relation scan is parallelized
> then you already have unpredictable input ordering and thus unpredictable
> aggregate results. So anyone who cares about that has already had
> to either disable parallelism or insert an ORDER BY somewhere, and
> either of those fixes will still work if the aggregates support
> parallelism.

I think users relying on the transition order without an ORDER BY
clause in the aggregate function likely have become accustomed to
having to disable various GUCs before they run their query.

For example, they might have something like:

BEGIN;
SET LOCAL enable_hashagg = 0;
SET LOCAL enable_hashjoin = 0;
SET LOCAL enable_nestloop = 0;

SELECT col, string_agg(col2, ','), array_agg(col3) FROM tab1 INNER
JOIN tab2 ON ... ORDER BY ...

COMMIT;

and are relying on a Merge Join being chosen so that the final join
rel is ordered according to the ORDER BY clause which allows GroupAgg
to skip the Sort phase thus providing a pre-sorted path for
aggregation.

Here we're adding to the list of GUCs they'd need to disable as
they'll maybe want to add SET max_parallel_workers_per_gather = 0; now
too. I bet all those people knew they were on thin ice already.

I stand by the fact we can't let undefined behaviour stand in the way
of future optimisations. Not having these disables more than just
parallel query. There's other reasons to want combine functions, for
example partial aggregation done during partition-wise aggregation is
disabled for aggregates without a combine function too. I'm pretty
sure we must be nearing wanting a give-me-partial-aggregate-results
syntax so that we can have FDWs feed us with partial results so we can
harness the resources of an array of foreign servers to farm off the
aggregation work for us. We already have async Append. It feels (to
me) like we're on the cusp of needing that syntax.

> Hence, I'm not going to fight hard if you really want
> to do it. But I remain unconvinced that the cost/benefit tradeoff
> is attractive.

Perhaps we're early enough in the release cycle to give enough time
for people to realise and complain if they're in that 80% group,
according to your estimates.

David

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Koterov 2022-08-03 01:57:41 Does having pg_last_wal_replay_lsn[replica] >= pg_current_wal_insert_lsn[master] guarantee that the replica is caught up?
Previous Message Masahiko Sawada 2022-08-03 01:35:14 Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns