Re: Parallel Aggregates for string_agg and array_agg

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Aggregates for string_agg and array_agg
Date: 2018-03-26 20:27:47
Message-ID: 6538.1522096067@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> writes:
> [ combinefn_for_string_and_array_aggs_v7.patch ]

I spent a fair amount of time hacking on this with intent to commit,
but just as I was getting to code that I liked, I started to have second
thoughts about whether this is a good idea at all. I quote from the fine
manual:

The aggregate functions array_agg, json_agg, jsonb_agg,
json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
similar user-defined aggregate functions, produce meaningfully
different result values depending on the order of the input
values. This ordering is unspecified by default, but can be controlled
by writing an ORDER BY clause within the aggregate call, as shown in
Section 4.2.7. Alternatively, supplying the input values from a sorted
subquery will usually work ...

I do not think it is accidental that these aggregates are exactly the ones
that do not have parallelism support today. Rather, that's because you
just about always have an interest in the order in which the inputs get
aggregated, which is something that parallel aggregation cannot support.

I fear that what will happen, if we commit this, is that something like
0.01% of the users of array_agg and string_agg will be pleased, another
maybe 20% will be unaffected because they wrote ORDER BY which prevents
parallel aggregation, and the remaining 80% will scream because we broke
their queries. Telling them they should've written ORDER BY isn't going
to cut it, IMO, when the benefit of that breakage will accrue only to some
very tiny fraction of use-cases.

In short, I think we ought to reject this.

Just in case I'm outvoted, attached is what I'd gotten done so far.
The main noncosmetic changes I'd made were to improve the caching logic
(it's silly to set up a lookup cache and then not cache the fmgr_info
lookup) and to not cheat quite as much on the StringInfo passed down to
the element typreceive function. There isn't any other place, I don't
think, where we don't honor the expectation that StringInfos have trailing
null bytes, and some places may depend on it --- array_recv does.

The main thing that remains undone is to get some test coverage ---
AFAICS, none of these new functions get exercised in the standard
regression tests.

I'm also a bit unhappy that the patch introduces code into
array_userfuncs.c that knows everything there is to know about the
contents of ArrayBuildState and ArrayBuildStateArr. Previously that
knowledge was pretty well localized in arrayfuncs.c. I wonder if it'd
be better to move the new combinefuncs and serial/deserial funcs into
arrayfuncs.c. Or perhaps export primitives from there that could be
used to build them.

regards, tom lane

Attachment Content-Type Size
combinefn_for_string_and_array_aggs_v8.patch text/x-diff 37.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vladimir Sitnikov 2018-03-26 20:30:47 Re: Proposal: http2 wire format
Previous Message Rady, Doug 2018-03-26 20:23:48 Re: PATCH: pgbench - option to build using ppoll() for larger connection counts