Re: Parallel Aggregates for string_agg and array_agg

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Aggregates for string_agg and array_agg
Date: 2018-03-11 11:10:00
Message-ID: 1b6ee15c-c218-761e-0120-c1ed438e2db2@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/11/2018 07:31 AM, David Rowley wrote:
> On 11 March 2018 at 12:11, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 03/05/2018 04:51 AM, David Rowley wrote:
>>> On 5 March 2018 at 04:54, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> Consider the following slightly backward-looking case;
>>>
>>> select string_agg(',', x::text) from generate_Series(1,10)x;
>>> string_agg
>>> ----------------------
>>> ,2,3,4,5,6,7,8,9,10,
>>>
>>> Here the delimiter is the number, not the ','. Of course, the
>>> delimiter for the first aggregated result is not shown. The previous
>>> implementation of the transfn for this aggregate just threw away the
>>> first delimiter, but that's no good for parallel aggregates as the
>>> transfn may be getting called in a parallel worker, in which case
>>> we'll need that delimiter in the combine function to properly join to
>>> the other partial aggregated results matching the same group.
>>>
>>
>> Hmmm, you're right I haven't considered the delimiter might be variable.
>> But isn't just yet another hint that while StringInfo was suitable for
>> aggregate state of non-parallel string_agg, it may not be for the
>> parallel version?
>>
>> What if the parallel string_agg instead used something like this:
>>
>> struct StringAggState
>> {
>> char *delimiter; /* first delimiter */
>> StringInfoData str; /* partial aggregate state */
>> } StringAggState;
>>
>> I think that would make the code cleaner, compared to using the cursor
>> field for that purpose. But maybe it'd make it not possible to share
>> code with the non-parallel aggregate, not sure.
>
> It would be possible to use something like that. The final function
> would just need to take 'str' and ignore 'delimiter', whereas the
> combine function would need both. However, you could optimize the
> above to just have a single StringInfoData and have a pointer to the
> start of the actual data (beyond the first delimiter), that would save
> us a call to palloc and also allow the state's data to exist in one
> contiguous block of memory, which will be more cache friendly when it
> comes to reading it back again. The pointer here would, of course,
> have to be an offset into the data array, since repallocs would cause
> problems as the state grew.
>
> This is kinda the option I was going for with using the cursor. I
> didn't think that was going to be a problem. It seems that cursor was
> invented so that users of StringInfo could store a marker somehow
> along with the string. The comment for cursor read:
>
> * cursor is initialized to zero by makeStringInfo or initStringInfo,
> * but is not otherwise touched by the stringinfo.c routines.
> * Some routines use it to scan through a StringInfo.
>
> The use of the cursor field does not interfere with pqformat.c's use
> of it as in the serialization functions we're creating a new
> StringInfo for the 'result'. If anything tries to send the internal
> state, then that's certainly broken. It needs to be serialized before
> that can happen.
>
> I don't quite see how inventing a new struct would make the code
> cleaner. It would make the serialization and deserialization functions
> have to write and read more fields for the lengths of the data
> contained in the state.
>
> I understand that the cursor field is used in pqformat.c quite a bit,
> but I don't quite understand why it should get to use that field the
> way it wants, but the serialization and deserialization functions
> shouldn't. I'd understand that if we were trying to phase out the
> cursor field from StringInfoData, but I can't imagine that's going to
> happen.
>
> Of course, with that all said. If you really strongly object to how
> I've done this, then I'll change it to use a new struct type. I had
> just tried to make the patch footprint as small as possible, and the
> above text is me just explaining my reasoning behind this, not me
> objecting to your request for me to change it. Let me know your
> thoughts after reading the above.
>

Hmmm, I guess you're right, I didn't really thought that through. I
don't object to sticking to the current approach.

> In the meantime, I've attached an updated patch. The only change it
> contains is updating the "Partial Mode" column in the docs from "No"
> to "Yes".
>

Yeah, seems fine to me. I wonder what else would be needed before
switching the patch to RFC. I plan to do that after a bit more testing
sometime early next week, unless someone objects.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-03-11 11:19:34 Re: [HACKERS] GUC for cleanup indexes threshold.
Previous Message David Rowley 2018-03-11 10:22:54 Parallel index creation does not properly cleanup after error