Re: Parallel Aggregates for string_agg and array_agg

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Aggregates for string_agg and array_agg
Date: 2018-03-11 06:31:49
Message-ID: CAKJS1f8LV7AT=AAhdYGKtGrGkSkEgO6C_SW2Ztz1sR3encisqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11 March 2018 at 12:11, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 03/05/2018 04:51 AM, David Rowley wrote:
>> On 5 March 2018 at 04:54, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> Consider the following slightly backward-looking case;
>>
>> select string_agg(',', x::text) from generate_Series(1,10)x;
>> string_agg
>> ----------------------
>> ,2,3,4,5,6,7,8,9,10,
>>
>> Here the delimiter is the number, not the ','. Of course, the
>> delimiter for the first aggregated result is not shown. The previous
>> implementation of the transfn for this aggregate just threw away the
>> first delimiter, but that's no good for parallel aggregates as the
>> transfn may be getting called in a parallel worker, in which case
>> we'll need that delimiter in the combine function to properly join to
>> the other partial aggregated results matching the same group.
>>
>
> Hmmm, you're right I haven't considered the delimiter might be variable.
> But isn't just yet another hint that while StringInfo was suitable for
> aggregate state of non-parallel string_agg, it may not be for the
> parallel version?
>
> What if the parallel string_agg instead used something like this:
>
> struct StringAggState
> {
> char *delimiter; /* first delimiter */
> StringInfoData str; /* partial aggregate state */
> } StringAggState;
>
> I think that would make the code cleaner, compared to using the cursor
> field for that purpose. But maybe it'd make it not possible to share
> code with the non-parallel aggregate, not sure.

It would be possible to use something like that. The final function
would just need to take 'str' and ignore 'delimiter', whereas the
combine function would need both. However, you could optimize the
above to just have a single StringInfoData and have a pointer to the
start of the actual data (beyond the first delimiter), that would save
us a call to palloc and also allow the state's data to exist in one
contiguous block of memory, which will be more cache friendly when it
comes to reading it back again. The pointer here would, of course,
have to be an offset into the data array, since repallocs would cause
problems as the state grew.

This is kinda the option I was going for with using the cursor. I
didn't think that was going to be a problem. It seems that cursor was
invented so that users of StringInfo could store a marker somehow
along with the string. The comment for cursor read:

* cursor is initialized to zero by makeStringInfo or initStringInfo,
* but is not otherwise touched by the stringinfo.c routines.
* Some routines use it to scan through a StringInfo.

The use of the cursor field does not interfere with pqformat.c's use
of it as in the serialization functions we're creating a new
StringInfo for the 'result'. If anything tries to send the internal
state, then that's certainly broken. It needs to be serialized before
that can happen.

I don't quite see how inventing a new struct would make the code
cleaner. It would make the serialization and deserialization functions
have to write and read more fields for the lengths of the data
contained in the state.

I understand that the cursor field is used in pqformat.c quite a bit,
but I don't quite understand why it should get to use that field the
way it wants, but the serialization and deserialization functions
shouldn't. I'd understand that if we were trying to phase out the
cursor field from StringInfoData, but I can't imagine that's going to
happen.

Of course, with that all said. If you really strongly object to how
I've done this, then I'll change it to use a new struct type. I had
just tried to make the patch footprint as small as possible, and the
above text is me just explaining my reasoning behind this, not me
objecting to your request for me to change it. Let me know your
thoughts after reading the above.

In the meantime, I've attached an updated patch. The only change it
contains is updating the "Partial Mode" column in the docs from "No"
to "Yes".

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
combinefn_for_string_and_array_aggs_v6.patch application/octet-stream 33.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2018-03-11 07:23:39 initdb help message about WAL segment size
Previous Message Peter Geoghegan 2018-03-11 05:48:37 Re: [HACKERS] MERGE SQL Statement for PG11