Re: Gather Merge

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gather Merge
Date: 2016-10-20 07:42:11
Message-ID: CAA4eK1+Bh8xDUJ9ypBN_OEvC2pV37byQcgWqHhi-V5kFjYQaVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
<rushabh(dot)lathia(at)gmail(dot)com> wrote:
> On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
>>
>> There is lot of common code between ExecGatherMerge and ExecGather.
>> Do you think it makes sense to have a common function to avoid the
>> duplicity?
>>
>> I see there are small discrepancies in both the codes like I don't see
>> the use of single_copy flag, as it is present in gather node.
>>
>
> Yes, even I thought to centrilize some things of ExecGather and
> ExecGatherMerge,
> but its really not something that is fixed. And I thought it might change
> particularly
> for the Gather Merge. And as explained by Robert single_copy doesn't make
> sense
> for the Gather Merge. I will still look into this to see if something can be
> make
> centralize.
>

Okay, I haven't thought about it, but do let me know if you couldn't
find any way to merge the code.

>>
>> 3.
>> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
>> force)
>> {
>> ..
>> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>> +
>> + /*
>> +
>> * try to read more tuple into nowait mode and store it into the tuple
>> + * array.
>> +
>> */
>> + if (HeapTupleIsValid(tup))
>> + fill_tuple_array(gm_state, reader);
>>
>> How the above read tuple is stored in array? In anycase the above
>> interface seems slightly awkward to me. Basically, I think what you
>> are trying to do here is after reading first tuple in waitmode, you
>> are trying to fill the array by reading more tuples. So, can't we
>> push reading of this fist tuple into that function and name it as
>> form_tuple_array().
>>
>
> Yes, you are right.
>

You have not answered my first question. I will try to ask again, how
the tuple read by below code is stored in the array:

>> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);

> First its trying to read tuple into wait-mode, and once
> it
> find tuple then its try to fill the tuple array (which basically try to read
> tuple
> into nowait-mode). Reason I keep it separate is because in case of
> initializing
> the gather merge, if we unable to read tuple from all the worker - while
> trying
> re-read, we again try to fill the tuple array for the worker who already
> produced
> atleast a single tuple (see gather_merge_init() for more details).
>

Whenever any worker produced one tuple, you already try to fill the
array in gather_merge_readnext(), so does the above code help much?

> Also I
> thought
> filling tuple array (which basically read tuple into nowait mode) and
> reading tuple
> (into wait-mode) are two separate task - and if its into separate function
> that code
> look more clear.

To me that looks slightly confusing.

> If you have any suggestion for the function name
> (fill_tuple_array)
> then I am open to change that.
>

form_tuple_array (form_tuple is used at many places in code, so it
should look okay). I think you might want to consider forming array
even for leader, although it might not be as beneficial as for
workers, OTOH, the code will get simplified if we do that way.

Today, I observed another issue in code:

+gather_merge_init(GatherMergeState *gm_state)
{
..
+reread:
+ for (i = 0; i < nreaders + 1; i++)
+ {
+ if (TupIsNull(gm_state->gm_slots[i]) ||
+ gm_state->gm_slots[i]->tts_isempty)
+ {
+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))
+ {
+ binaryheap_add_unordered(gm_state->gm_heap,
+ Int32GetDatum(i));
+ }
+ }
+ else
+ fill_tuple_array(gm_state, i);
+ }
+ initialize = false;
+
+ for (i = 0; i < nreaders; i++)
+ if (TupIsNull(gm_state->gm_slots[i]) || gm_state->gm_slots[i]->tts_isempty)
+ goto reread;
..
}

This code can cause infinite loop. Consider a case where one of the
worker doesn't get any tuple because by the time it starts all the
tuples are consumed by all other workers. The above code will keep on
looping to fetch the tuple from that worker whereas that worker can't
return any tuple. I think you can fix it by checking if the
particular queue has been exhausted.

>> >
>> > Open Issue:
>> >
>> > - Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
>> > tqueue.c by
>> > calling gather_readnext() into per-tuple context. Now for gather merge
>> > that
>> > is
>> > not possible, as we storing the tuple into Tuple array and we want tuple
>> > to
>> > be
>> > free only its get pass through the merge sort algorithm. One idea is, we
>> > can
>> > also call gm_readnext_tuple() under per-tuple context (which will fix
>> > the
>> > leak
>> > into tqueue.c) and then store the copy of tuple into tuple array.
>> >
>>
>> Won't extra copy per tuple impact performance? Is the fix in
>> mentioned commit was for record or composite types, if so, does
>> GatherMerge support such types and if it support, does it provide any
>> benefit over Gather?
>>
>
> I don't think was specificially for the record or composite types - but I
> might be
> wrong. As per my understanding commit fix leak into tqueue.c.
>

Hmm, in tqueue.c, I think the memory leak was remapping logic, refer
mail [1] of Tom (Just to add insult to injury, the backend's memory
consumption bloats to something over 5.5G during that last query).

> Fix was to add
> standard to call TupleQueueReaderNext() with shorter memory context - so
> that
> tqueue.c doesn't leak the memory.
>
> I have idea to fix this by calling the TupleQueueReaderNext() with per-tuple
> context,
> and then copy the tuple and store it to the tuple array and later with the
> next run of
> ExecStoreTuple() will free the earlier tuple. I will work on that.
>

Okay, if you think that is viable, then you can do it, but do check
the performance impact of same, because extra copy per fetched tuple
can impact performance.

[1] - https://www.postgresql.org/message-id/32763.1469821037%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2016-10-20 09:06:48 Re: postgres_fdw : altering foreign table not invalidating prepare statement execution plan.
Previous Message Dilip Kumar 2016-10-20 07:36:00 Re: Speed up Clog Access by increasing CLOG buffers