Re: Parallel Aggregates for string_agg and array_agg

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Aggregates for string_agg and array_agg
Date: 2018-03-26 22:14:27
Message-ID: 13a2ab54-0d83-d2a2-faa2-963310d8f75f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/26/2018 11:19 PM, Tom Lane wrote:
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> writes:
>> On 03/26/2018 10:27 PM, Tom Lane wrote:
>>> I fear that what will happen, if we commit this, is that something like
>>> 0.01% of the users of array_agg and string_agg will be pleased, another
>>> maybe 20% will be unaffected because they wrote ORDER BY which prevents
>>> parallel aggregation, and the remaining 80% will scream because we broke
>>> their queries. Telling them they should've written ORDER BY isn't going
>>> to cut it, IMO, when the benefit of that breakage will accrue only to some
>>> very tiny fraction of use-cases.
>
>> Isn't the ordering unreliable *already*?
>
> Not if the query is such that what gets chosen is, say, an indexscan
> or mergejoin. It might be theoretically unreliable and yet work fine
> for a given application.
>

What about parallel versions of those plans, though? That is, what
prevents the planner from generating plans like this

Hash Aggregate
-> Gather
-> Partial Index Scan

AFAIK that's a valid plan, and it does not have the stable ordering
property of a non-parallel index scan. Similarly for Merge Join.

> I might be too pessimistic about the fraction of users who are depending
> on ordered input without having written anything that explicitly forces
> that ... but I stand by the theory that it substantially exceeds the
> fraction of users who could get any benefit.
>

Not sure, it's pretty much impossible to produce a number that would
meaningfully represent the whole user base. All I can say is that the
number of people I know, who would benefit from this is significant.

>
> Your own example of assuming that separate aggregates are computed
> in the same order reinforces my point, I think. In principle, anybody
> who's doing that should write
>
> array_agg(e order by x),
> array_agg(f order by x),
> string_agg(g order by x)
>
> because otherwise they shouldn't assume that; the manual certainly doesn't
> promise it. But nobody does that in production, because if they did
> they'd get killed by the fact that the sorts are all done independently.

Yes, the manual does not promise that. But I find that expectation way
more "natural" than deterministic ordering without ORDER BY clause. I
expect each tuple to be processed "at once" - that it's added to all
arrays in the same step, irrespectedly of the ordering of input tuples.

> (We should improve that someday, but it hasn't been done yet.) So I think
> there are an awful lot of people out there who are assuming more than a
> lawyerly reading of the manual would allow. Their reaction to this will
> be about like ours every time the GCC guys decide that some longstanding
> behavior of C code isn't actually promised by the text of the C standard.
>

But we're not *already* providing reliable ordering without an ORDER BY
clause (both for queries as a whole and array_agg/string_agg). It can
produce arbitrary ordering, depending on what plan you happen to get.

So I don't think we'd be changing any longstanding behavior, because
it's already unreliable. If it was semi-reliable before, assuming the
query only ever chose index scans or merge joins, it likely got broken
by introduction of parallel versions of those plans. If the users
disable parallel queries (to get the old plans), they're going to get
non-parallel semi-reliable plans again.

OTOH we are currently accumulating all the arrays in the same order, so
the results will match. (At least that's what I believe is the current
reliable behavior.) So following the GCC analogy, changing this seems
much more akin to GCC people changing some longstanding behavior.

Also, if the arrays were not accumulated in the same way reliably,
wouldn't this mean all the discussion about reliable ordering with
certain plans is pointless?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-03-26 22:28:46 Re: Parallel Aggregates for string_agg and array_agg
Previous Message Tom Lane 2018-03-26 22:12:09 Re: Backend memory dump analysis