Re: Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Regina Obe <lr(at)pcorp(dot)us>, 'Peter Eisentraut' <peter(dot)eisentraut(at)2ndquadrant(dot)com>, 'PostgreSQL-development' <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg
Date: 2017-06-18 23:08:22
Message-ID: 269bca9e-9248-2d22-82be-6e82bbc101b3@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 6/7/17 5:52 AM, Regina Obe wrote:
>> On 6/6/17 13:52, Regina Obe wrote:
>>> It seems CREATE AGGREGATE was expanded in 9.6 to support
>>> parallelization of aggregate functions using transitions, with the
>>> addition of serialfunc and deserialfunc to the aggregate definitions.
>>>
>>> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
>>>
>>> I was looking at the PostgreSQL 10 source code for some example usages
>>> of this and was hoping that array_agg and string_agg would support the feature.
>
>> I'm not sure how you would parallelize these, since in most uses
>> you want to have a deterministic output order.
>
>> --
>> Peter Eisentraut http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
> Good point. If that's the reason it wasn't done, that's good just wasn't sure.
>
> But if you didn't have an ORDER BY in your aggregate usage, and you
> did have those transition functions, it shouldn't be any different from
> any other use case right?
> I imagine you are right that most folks who use array_agg and
> string_agg usually combine it with array_agg(... ORDER BY ..)
>

I think that TL had in mind is something like

SELECT array_agg(x) FROM (
SELECT x FROM bar ORDER BY y
) foo;

i.e. a subquery producing the data in predictable order.

>
> My main reason for asking is that most of the PostGIS geometry and
> raster aggregate functions use transitions and were patterned after
> array agg.
>

> In the case of PostGIS the sorting is done internally and really
> only to expedite take advantage of things like cascaded union
> algorithms.
> That is always done though (so even if each worker does it on just it's
> batch that's still better than having only one worker).
> So I think it's still very beneficial to break into separate jobs
> since in the end the gather, will have say 2 biggish geometries or 2
> biggish rasters to union if you have 2 workers which is still better
> than having a million smallish geometries/rasters to union
I'm not sure I got your point correctly, but if you can (for example)
sort the per-worker results as part of the "serialize" function, and
benefit from that while combining that in the gather, then sure, that
should be a huge win.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2017-06-19 00:13:57 Re: improve release-note for pg_current_logfile()
Previous Message Andrew Gierth 2017-06-18 23:06:15 Re: PG10 transition tables, wCTEs and multiple operations on the same table