From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Regina Obe <lr(at)pcorp(dot)us>, 'Peter Eisentraut' <peter(dot)eisentraut(at)2ndquadrant(dot)com>, 'PostgreSQL-development' <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg |
Date: | 2017-06-18 23:08:22 |
Message-ID: | 269bca9e-9248-2d22-82be-6e82bbc101b3@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 6/7/17 5:52 AM, Regina Obe wrote:
>> On 6/6/17 13:52, Regina Obe wrote:
>>> It seems CREATE AGGREGATE was expanded in 9.6 to support
>>> parallelization of aggregate functions using transitions, with the
>>> addition of serialfunc and deserialfunc to the aggregate definitions.
>>>
>>> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
>>>
>>> I was looking at the PostgreSQL 10 source code for some example usages
>>> of this and was hoping that array_agg and string_agg would support the feature.
>
>> I'm not sure how you would parallelize these, since in most uses
>> you want to have a deterministic output order.
>
>> --
>> Peter Eisentraut http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
> Good point. If that's the reason it wasn't done, that's good just wasn't sure.
>
> But if you didn't have an ORDER BY in your aggregate usage, and you
> did have those transition functions, it shouldn't be any different from
> any other use case right?
> I imagine you are right that most folks who use array_agg and
> string_agg usually combine it with array_agg(... ORDER BY ..)
>
I think that TL had in mind is something like
SELECT array_agg(x) FROM (
SELECT x FROM bar ORDER BY y
) foo;
i.e. a subquery producing the data in predictable order.
>
> My main reason for asking is that most of the PostGIS geometry and
> raster aggregate functions use transitions and were patterned after
> array agg.
>
> In the case of PostGIS the sorting is done internally and really
> only to expedite take advantage of things like cascaded union
> algorithms.
> That is always done though (so even if each worker does it on just it's
> batch that's still better than having only one worker).
> So I think it's still very beneficial to break into separate jobs
> since in the end the gather, will have say 2 biggish geometries or 2
> biggish rasters to union if you have 2 workers which is still better
> than having a million smallish geometries/rasters to union
I'm not sure I got your point correctly, but if you can (for example)
sort the per-worker results as part of the "serialize" function, and
benefit from that while combining that in the gather, then sure, that
should be a huge win.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tatsuo Ishii | 2017-06-19 00:13:57 | Re: improve release-note for pg_current_logfile() |
Previous Message | Andrew Gierth | 2017-06-18 23:06:15 | Re: PG10 transition tables, wCTEs and multiple operations on the same table |