Re: Gather Merge

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gather Merge
Date: 2016-10-27 09:50:36
Message-ID: CAGPqQf3TsdOCTSHUBqR+K4jRYuU62Y+NepwcrCm0oe=n1kP1XA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Please find attached latest patch which fix the review point as well as
additional clean-up.

- Get rid of funnel_slot as its not needed for the Gather Merge
- renamed fill_tuple_array to form_tuple_array
- Fix possible infinite loop into gather_merge_init (Reported by Amit
Kaplia)
- Fix tqueue.c memory leak, by calling TupleQueueReaderNext() with
per-tuple context.
- Code cleanup into ExecGatherMerge.
- Added new function gather_merge_clear_slots(), which clear out all gather
merge slots and also free tuple array at end of execution.

I did the performance testing again with the latest patch and I haven't
observed any regression. Some of TPC-H queries showing additional benefit
with
the latest patch, but its just under 5%.

Do let me know if I missed anything.

On Mon, Oct 24, 2016 at 11:55 AM, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
wrote:

>
>
> On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>
>> On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
>> <rushabh(dot)lathia(at)gmail(dot)com> wrote:
>> > On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> > wrote:
>> >>
>> >> There is lot of common code between ExecGatherMerge and ExecGather.
>> >> Do you think it makes sense to have a common function to avoid the
>> >> duplicity?
>> >>
>> >> I see there are small discrepancies in both the codes like I don't see
>> >> the use of single_copy flag, as it is present in gather node.
>> >>
>> >
>> > Yes, even I thought to centrilize some things of ExecGather and
>> > ExecGatherMerge,
>> > but its really not something that is fixed. And I thought it might
>> change
>> > particularly
>> > for the Gather Merge. And as explained by Robert single_copy doesn't
>> make
>> > sense
>> > for the Gather Merge. I will still look into this to see if something
>> can be
>> > make
>> > centralize.
>> >
>>
>> Okay, I haven't thought about it, but do let me know if you couldn't
>> find any way to merge the code.
>>
>>
> Sure, I will look into this.
>
>
>> >>
>> >> 3.
>> >> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
>> >> force)
>> >> {
>> >> ..
>> >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>> >> +
>> >> + /*
>> >> +
>> >> * try to read more tuple into nowait mode and store it into the tuple
>> >> + * array.
>> >> +
>> >> */
>> >> + if (HeapTupleIsValid(tup))
>> >> + fill_tuple_array(gm_state, reader);
>> >>
>> >> How the above read tuple is stored in array? In anycase the above
>> >> interface seems slightly awkward to me. Basically, I think what you
>> >> are trying to do here is after reading first tuple in waitmode, you
>> >> are trying to fill the array by reading more tuples. So, can't we
>> >> push reading of this fist tuple into that function and name it as
>> >> form_tuple_array().
>> >>
>> >
>> > Yes, you are right.
>> >
>>
>> You have not answered my first question. I will try to ask again, how
>> the tuple read by below code is stored in the array:
>>
>>
> Tuple directly get stored into related TupleTableSlot.
> In gather_merge_readnext()
> at the end of function it build the TupleTableSlot for the given tuple. So
> tuple
> read by above code is directly stored into TupleTableSlot.
>
> >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>>
>> > First its trying to read tuple into wait-mode, and once
>> > it
>> > find tuple then its try to fill the tuple array (which basically try to
>> read
>> > tuple
>> > into nowait-mode). Reason I keep it separate is because in case of
>> > initializing
>> > the gather merge, if we unable to read tuple from all the worker - while
>> > trying
>> > re-read, we again try to fill the tuple array for the worker who already
>> > produced
>> > atleast a single tuple (see gather_merge_init() for more details).
>> >
>>
>> Whenever any worker produced one tuple, you already try to fill the
>> array in gather_merge_readnext(), so does the above code help much?
>>
>> > Also I
>> > thought
>> > filling tuple array (which basically read tuple into nowait mode) and
>> > reading tuple
>> > (into wait-mode) are two separate task - and if its into separate
>> function
>> > that code
>> > look more clear.
>>
>> To me that looks slightly confusing.
>>
>> > If you have any suggestion for the function name
>> > (fill_tuple_array)
>> > then I am open to change that.
>> >
>>
>> form_tuple_array (form_tuple is used at many places in code, so it
>> should look okay).
>
>
> Ok, I rename it with next patch.
>
>
>> I think you might want to consider forming array
>> even for leader, although it might not be as beneficial as for
>> workers, OTOH, the code will get simplified if we do that way.
>>
>
> Yes, I did that earlier - and as you guessed its not be any beneficial
> so to avoided extra memory allocation for the tuple array, I am not
> forming array for leader.
>
> Today, I observed another issue in code:
>>
>> +gather_merge_init(GatherMergeState *gm_state)
>> {
>> ..
>> +reread:
>> + for (i = 0; i < nreaders + 1; i++)
>> + {
>> + if (TupIsNull(gm_state->gm_slots[i]) ||
>> + gm_state->gm_slots[i]->tts_isempty)
>> + {
>> + if (gather_merge_readnext(gm_state, i, initialize ? false : true))
>> + {
>> + binaryheap_add_unordered(gm_state->gm_heap,
>> + Int32GetDatum(i));
>> + }
>> + }
>> + else
>> + fill_tuple_array(gm_state, i);
>> + }
>> + initialize = false;
>> +
>> + for (i = 0; i < nreaders; i++)
>> + if (TupIsNull(gm_state->gm_slots[i]) || gm_state->gm_slots[i]->tts_ise
>> mpty)
>> + goto reread;
>> ..
>> }
>>
>> This code can cause infinite loop. Consider a case where one of the
>> worker doesn't get any tuple because by the time it starts all the
>> tuples are consumed by all other workers. The above code will keep on
>> looping to fetch the tuple from that worker whereas that worker can't
>> return any tuple. I think you can fix it by checking if the
>> particular queue has been exhausted.
>>
>>
> Oh yes. I will work on the fix and soon submit the next set of patch.
>
>
>> >> >
>> >> > Open Issue:
>> >> >
>> >> > - Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak
>> into
>> >> > tqueue.c by
>> >> > calling gather_readnext() into per-tuple context. Now for gather
>> merge
>> >> > that
>> >> > is
>> >> > not possible, as we storing the tuple into Tuple array and we want
>> tuple
>> >> > to
>> >> > be
>> >> > free only its get pass through the merge sort algorithm. One idea
>> is, we
>> >> > can
>> >> > also call gm_readnext_tuple() under per-tuple context (which will fix
>> >> > the
>> >> > leak
>> >> > into tqueue.c) and then store the copy of tuple into tuple array.
>> >> >
>> >>
>> >> Won't extra copy per tuple impact performance? Is the fix in
>> >> mentioned commit was for record or composite types, if so, does
>> >> GatherMerge support such types and if it support, does it provide any
>> >> benefit over Gather?
>> >>
>> >
>> > I don't think was specificially for the record or composite types - but
>> I
>> > might be
>> > wrong. As per my understanding commit fix leak into tqueue.c.
>> >
>>
>> Hmm, in tqueue.c, I think the memory leak was remapping logic, refer
>> mail [1] of Tom (Just to add insult to injury, the backend's memory
>> consumption bloats to something over 5.5G during that last query).
>>
>> > Fix was to add
>> > standard to call TupleQueueReaderNext() with shorter memory context - so
>> > that
>> > tqueue.c doesn't leak the memory.
>> >
>> > I have idea to fix this by calling the TupleQueueReaderNext() with
>> per-tuple
>> > context,
>> > and then copy the tuple and store it to the tuple array and later with
>> the
>> > next run of
>> > ExecStoreTuple() will free the earlier tuple. I will work on that.
>> >
>>
>> Okay, if you think that is viable, then you can do it, but do check
>> the performance impact of same, because extra copy per fetched tuple
>> can impact performance.
>>
>>
> Sure, I will check the performance impact for the same.
>
>
>
>>
>> [1] - https://www.postgresql.org/message-id/32763.1469821037%40sss
>> .pgh.pa.us
>>
>> --
>> With Regards,
>> Amit Kapila.
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
>
>
> Thanks,
> Rushabh Lathia
> www.EnterpriseDB.com
>

--
Rushabh Lathia

Attachment Content-Type Size
gather_merge_v3.patch application/x-download 50.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-10-27 10:16:00 Re: [BUG] pg_basebackup from disconnected standby fails
Previous Message Christoph Berg 2016-10-27 09:07:43 Re: Patch to implement pg_current_logfile() function