Re: Parallel Aggregates for string_agg and array_agg

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Aggregates for string_agg and array_agg
Date: 2018-03-26 21:20:06
Message-ID: 20180326212006.GT24540@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
> On 03/26/2018 10:27 PM, Tom Lane wrote:
> > I fear that what will happen, if we commit this, is that something like
> > 0.01% of the users of array_agg and string_agg will be pleased, another
> > maybe 20% will be unaffected because they wrote ORDER BY which prevents
> > parallel aggregation, and the remaining 80% will scream because we broke
> > their queries. Telling them they should've written ORDER BY isn't going
> > to cut it, IMO, when the benefit of that breakage will accrue only to some
> > very tiny fraction of use-cases.
>
> Isn't the ordering unreliable *already*? It depends on ordering of
> tuples on the input. So if the table is scanned by index scan or
> sequential scan, that will affect the array_agg/string_agg results. If
> the input is a join, it's even more volatile.

It's not even just that- for a seq scan, it'll also depend on if anyone
else is also scanning the table as we might not start at the beginning
of the table thanks to synchronize seq scans.

> IMHO it's not like we're making the ordering unpredictable - it's been
> like that since forever.

Yeah, there certainly seems like a lot of opportunity for the ordering
to end up being volatile already and queries depending on it not
changing really shouldn't be making that assumption. I do think it was
probably a bad move on our part to say that ordering a subquery will
"usually" work in the documentation but having that slip-up in the
documentation mean that we actually *are* going to guarantee it'll
always work to use a subquery ordering to feed an aggregate is a pretty
big stretch (and is it even really true today anyway? the documentation
certainly doesn't seem to be clear on that...).

> Also, how is this different from ORDER BY clause? If a user does not
> specify an ORDER BY clause, I don't think we'd care very much about
> changes to output ordering due to plan changes, for example.

This seems like it boils down to "well, everyone *knows* that rows can
come back in any order" combined with "our docs claim that an ordered
subquery will *usually* work".

In the end, I do tend to agree that we probably should add parallel
support to these aggregates, but it'd also be nice to hear from those
who had worked to add parallelism to the various aggregates as to if
there was some reason these were skipped.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-26 21:31:18 Re: Parallel Aggregates for string_agg and array_agg
Previous Message Tom Lane 2018-03-26 21:19:54 Re: Parallel Aggregates for string_agg and array_agg