|From:||Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>|
|To:||David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>|
|Cc:||Stephen Frost <sfrost(at)snowman(dot)net>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||Re: Parallel Aggregates for string_agg and array_agg|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> writes:
> [ combinefn_for_string_and_array_aggs_v7.patch ]
I spent a fair amount of time hacking on this with intent to commit,
but just as I was getting to code that I liked, I started to have second
thoughts about whether this is a good idea at all. I quote from the fine
The aggregate functions array_agg, json_agg, jsonb_agg,
json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
similar user-defined aggregate functions, produce meaningfully
different result values depending on the order of the input
values. This ordering is unspecified by default, but can be controlled
by writing an ORDER BY clause within the aggregate call, as shown in
Section 4.2.7. Alternatively, supplying the input values from a sorted
subquery will usually work ...
I do not think it is accidental that these aggregates are exactly the ones
that do not have parallelism support today. Rather, that's because you
just about always have an interest in the order in which the inputs get
aggregated, which is something that parallel aggregation cannot support.
I fear that what will happen, if we commit this, is that something like
0.01% of the users of array_agg and string_agg will be pleased, another
maybe 20% will be unaffected because they wrote ORDER BY which prevents
parallel aggregation, and the remaining 80% will scream because we broke
their queries. Telling them they should've written ORDER BY isn't going
to cut it, IMO, when the benefit of that breakage will accrue only to some
very tiny fraction of use-cases.
In short, I think we ought to reject this.
Just in case I'm outvoted, attached is what I'd gotten done so far.
The main noncosmetic changes I'd made were to improve the caching logic
(it's silly to set up a lookup cache and then not cache the fmgr_info
lookup) and to not cheat quite as much on the StringInfo passed down to
the element typreceive function. There isn't any other place, I don't
think, where we don't honor the expectation that StringInfos have trailing
null bytes, and some places may depend on it --- array_recv does.
The main thing that remains undone is to get some test coverage ---
AFAICS, none of these new functions get exercised in the standard
I'm also a bit unhappy that the patch introduces code into
array_userfuncs.c that knows everything there is to know about the
contents of ArrayBuildState and ArrayBuildStateArr. Previously that
knowledge was pretty well localized in arrayfuncs.c. I wonder if it'd
be better to move the new combinefuncs and serial/deserial funcs into
arrayfuncs.c. Or perhaps export primitives from there that could be
used to build them.
regards, tom lane
|Next Message||Vladimir Sitnikov||2018-03-26 20:30:47||Re: Proposal: http2 wire format|
|Previous Message||Rady, Doug||2018-03-26 20:23:48||Re: PATCH: pgbench - option to build using ppoll() for larger connection counts|