Re: Gather Merge

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gather Merge
Date: 2016-10-24 06:25:37
Message-ID: CAGPqQf2cqGb9sAmZMhxAofLCwk1vBgCs5RsBp1EbYpRN_5CKbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 20, 2016 at 1:12 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:

> On Tue, Oct 18, 2016 at 5:29 PM, Rushabh Lathia
> <rushabh(dot)lathia(at)gmail(dot)com> wrote:
> > On Mon, Oct 17, 2016 at 2:26 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> > wrote:
> >>
> >> There is lot of common code between ExecGatherMerge and ExecGather.
> >> Do you think it makes sense to have a common function to avoid the
> >> duplicity?
> >>
> >> I see there are small discrepancies in both the codes like I don't see
> >> the use of single_copy flag, as it is present in gather node.
> >>
> >
> > Yes, even I thought to centrilize some things of ExecGather and
> > ExecGatherMerge,
> > but its really not something that is fixed. And I thought it might change
> > particularly
> > for the Gather Merge. And as explained by Robert single_copy doesn't make
> > sense
> > for the Gather Merge. I will still look into this to see if something
> can be
> > make
> > centralize.
> >
>
> Okay, I haven't thought about it, but do let me know if you couldn't
> find any way to merge the code.
>
>
Sure, I will look into this.

> >>
> >> 3.
> >> +gather_merge_readnext(GatherMergeState * gm_state, int reader, bool
> >> force)
> >> {
> >> ..
> >> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
> >> +
> >> + /*
> >> +
> >> * try to read more tuple into nowait mode and store it into the tuple
> >> + * array.
> >> +
> >> */
> >> + if (HeapTupleIsValid(tup))
> >> + fill_tuple_array(gm_state, reader);
> >>
> >> How the above read tuple is stored in array? In anycase the above
> >> interface seems slightly awkward to me. Basically, I think what you
> >> are trying to do here is after reading first tuple in waitmode, you
> >> are trying to fill the array by reading more tuples. So, can't we
> >> push reading of this fist tuple into that function and name it as
> >> form_tuple_array().
> >>
> >
> > Yes, you are right.
> >
>
> You have not answered my first question. I will try to ask again, how
> the tuple read by below code is stored in the array:
>
>
Tuple directly get stored into related TupleTableSlot.
In gather_merge_readnext()
at the end of function it build the TupleTableSlot for the given tuple. So
tuple
read by above code is directly stored into TupleTableSlot.

>> + tup = gm_readnext_tuple(gm_state, reader, force, NULL);
>
> > First its trying to read tuple into wait-mode, and once
> > it
> > find tuple then its try to fill the tuple array (which basically try to
> read
> > tuple
> > into nowait-mode). Reason I keep it separate is because in case of
> > initializing
> > the gather merge, if we unable to read tuple from all the worker - while
> > trying
> > re-read, we again try to fill the tuple array for the worker who already
> > produced
> > atleast a single tuple (see gather_merge_init() for more details).
> >
>
> Whenever any worker produced one tuple, you already try to fill the
> array in gather_merge_readnext(), so does the above code help much?
>
> > Also I
> > thought
> > filling tuple array (which basically read tuple into nowait mode) and
> > reading tuple
> > (into wait-mode) are two separate task - and if its into separate
> function
> > that code
> > look more clear.
>
> To me that looks slightly confusing.
>
> > If you have any suggestion for the function name
> > (fill_tuple_array)
> > then I am open to change that.
> >
>
> form_tuple_array (form_tuple is used at many places in code, so it
> should look okay).

Ok, I rename it with next patch.

> I think you might want to consider forming array
> even for leader, although it might not be as beneficial as for
> workers, OTOH, the code will get simplified if we do that way.
>

Yes, I did that earlier - and as you guessed its not be any beneficial
so to avoided extra memory allocation for the tuple array, I am not
forming array for leader.

Today, I observed another issue in code:
>
> +gather_merge_init(GatherMergeState *gm_state)
> {
> ..
> +reread:
> + for (i = 0; i < nreaders + 1; i++)
> + {
> + if (TupIsNull(gm_state->gm_slots[i]) ||
> + gm_state->gm_slots[i]->tts_isempty)
> + {
> + if (gather_merge_readnext(gm_state, i, initialize ? false : true))
> + {
> + binaryheap_add_unordered(gm_state->gm_heap,
> + Int32GetDatum(i));
> + }
> + }
> + else
> + fill_tuple_array(gm_state, i);
> + }
> + initialize = false;
> +
> + for (i = 0; i < nreaders; i++)
> + if (TupIsNull(gm_state->gm_slots[i]) || gm_state->gm_slots[i]->tts_
> isempty)
> + goto reread;
> ..
> }
>
> This code can cause infinite loop. Consider a case where one of the
> worker doesn't get any tuple because by the time it starts all the
> tuples are consumed by all other workers. The above code will keep on
> looping to fetch the tuple from that worker whereas that worker can't
> return any tuple. I think you can fix it by checking if the
> particular queue has been exhausted.
>
>
Oh yes. I will work on the fix and soon submit the next set of patch.

> >> >
> >> > Open Issue:
> >> >
> >> > - Commit af33039317ddc4a0e38a02e2255c2bf453115fd2 fixed the leak into
> >> > tqueue.c by
> >> > calling gather_readnext() into per-tuple context. Now for gather merge
> >> > that
> >> > is
> >> > not possible, as we storing the tuple into Tuple array and we want
> tuple
> >> > to
> >> > be
> >> > free only its get pass through the merge sort algorithm. One idea is,
> we
> >> > can
> >> > also call gm_readnext_tuple() under per-tuple context (which will fix
> >> > the
> >> > leak
> >> > into tqueue.c) and then store the copy of tuple into tuple array.
> >> >
> >>
> >> Won't extra copy per tuple impact performance? Is the fix in
> >> mentioned commit was for record or composite types, if so, does
> >> GatherMerge support such types and if it support, does it provide any
> >> benefit over Gather?
> >>
> >
> > I don't think was specificially for the record or composite types - but I
> > might be
> > wrong. As per my understanding commit fix leak into tqueue.c.
> >
>
> Hmm, in tqueue.c, I think the memory leak was remapping logic, refer
> mail [1] of Tom (Just to add insult to injury, the backend's memory
> consumption bloats to something over 5.5G during that last query).
>
> > Fix was to add
> > standard to call TupleQueueReaderNext() with shorter memory context - so
> > that
> > tqueue.c doesn't leak the memory.
> >
> > I have idea to fix this by calling the TupleQueueReaderNext() with
> per-tuple
> > context,
> > and then copy the tuple and store it to the tuple array and later with
> the
> > next run of
> > ExecStoreTuple() will free the earlier tuple. I will work on that.
> >
>
> Okay, if you think that is viable, then you can do it, but do check
> the performance impact of same, because extra copy per fetched tuple
> can impact performance.
>
>
Sure, I will check the performance impact for the same.

>
> [1] - https://www.postgresql.org/message-id/32763.1469821037%
> 40sss.pgh.pa.us
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2016-10-24 06:36:55 Re: [COMMITTERS] pgsql: Remove extra comma at end of enum list
Previous Message Michael Paquier 2016-10-24 05:46:43 Re: pg_basebackup stream xlog to tar