Re: Gather Merge

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gather Merge
Date: 2016-11-04 03:00:15
Message-ID: CAEepm=3vo9P0PCyMMSunoJzEeK438hrtFdXfoqGROS7otiStpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
<rushabh(dot)lathia(at)gmail(dot)com> wrote:
> Please find attached latest patch which fix the review point as well as
> additional clean-up.

I've signed up to review this patch and I'm planning to do some
testing. Here's some initial feedback after a quick read-through:

+ if (gather_merge_readnext(gm_state, i, initialize ? false : true))

Clunky ternary operator... how about "!initialize".

+/*
+ * Function clear out a slot in the tuple table for each gather merge
+ * slots and returns the clear clear slot.
+ */

Maybe better like this? "_Clear_ out a slot in the tuple table for
each gather merge _slot_ and _return_ the _cleared_ slot."

+ /* Free tuple array as we no more need it */

"... as we don't need it any more"

+/*
+ * Read the next tuple for gather merge.
+ *
+ * Function fetch the sorted tuple out of the heap.
+ */

"_Fetch_ the sorted tuple out of the heap."

+ * Otherwise, pull the next tuple from whichever participate we
+ * returned from last time, and reinsert the index into the heap,
+ * because it might now compare differently against the existing

s/participate/participant/

+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California

Shouldn't this say just "(c) 2016, PostgreSQL Global Development
Group"? Are we supposed to be blaming the University of California
for new files?

+#include "executor/tqueue.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "lib/binaryheap.h"

Not correctly sorted.

+ /*
+ * store the tuple descriptor into gather merge state, so we can use it
+ * later while initilizing the gather merge slots.
+ */

s/initilizing/initializing/

+/* ----------------------------------------------------------------
+ * ExecEndGatherMerge
+ *
+ * frees any storage allocated through C routines.
+ * ----------------------------------------------------------------

The convention in Postgres code seems to be to use a form like "Free
any storage ..." in function documentation. Not sure if that's an
imperative, an infinitive, or if the word "we" is omitted since
English is so fuzzy like that, but it's inconsistent with other
documentation to use "frees" here. Oh, I see that exact wording is in
several other files. I guess I'll just leave this as a complaint
about all those files then :-)

+ * Pull atleast single tuple from each worker + leader and set up the heap.

s/atleast single/at least a single/

+ * Read the tuple for given reader into nowait mode, and form the tuple array.

s/ into / in /

+ * Function attempt to read tuple for the given reader and store it into reader

s/Function attempt /Attempt /

+ * Function returns true if found tuple for the reader, otherwise returns

s/Function returns /Return /

+ * First try to read tuple for each worker (including leader) into nowait
+ * mode, so that we initialize read from each worker as well as leader.

I wonder if it would be good to standardise on the terminology we use
when we mean workers AND the leader. In my Parallel Shared Hash work,
I've been saying 'participants' if I mean = workers + leader. What do
you think?

+ * After this if all active worker unable to produce the tuple, then
+ * re-read and this time read the tuple into wait mode. For the worker,
+ * which was able to produced single tuple in the earlier loop and still
+ * active, just try fill the tuple array if more tuples available.
+ */

How about this? "After this, if all active workers are unable to
produce a tuple, then re-read and this time us wait mode. For workers
that were able to produce a tuple in the earlier loop and are still
active, just try to fill the tuple array if more tuples are
available."

+ * The heap is never spilled to disk, since we assume N is not very large. So
+ * this is much simple then cost_sort.

s/much simple then/much simpler than/

+ /*
+ * Avoid log(0)...
+ */
+ N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
+ logN = LOG2(N);
...
+ /* Per-tuple heap maintenance cost */
+ run_cost += path->path.rows * comparison_cost * 2.0 * logN;

Why multiply by two? The comment above this code says "about log2(N)
comparisons to delete the top heap entry and another log2(N)
comparisons to insert its successor". In fact gather_merge_getnext
calls binaryheap_replace_first, which replaces the top element without
any comparisons at all and then performs a sift-down in log2(N)
comparisons to find its new position. There is no per-tuple "delete"
involved. We "replace" the top element with the value it already had,
just to trigger the sift-down, because we know that our comparator
function might have a new opinion of the sort order of this element.
Very clever! The comment and the 2.0 factor in cost_gather_merge seem
to be wrong though -- or am I misreading the code?

Also, shouldn't we add 1 to N to account for the leader? Suppose
there are 2 workers. There are 3 elements in the binary heap. The
element to be sifted down must be compared against either 1 or 2
others to reorganise the heap. Surely in that case we should estimate
log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison.

I suspect that the leader's contribution will be equivalent to a whole
worker if the plan involves a sort: as soon as the leader pulls a
tuple in gather_merge_init, the sort node will pull all the tuples it
can in a tight loop. It's unfortunate that cost_seqscan has to
estimate what the leader's contribution will be without knowing
whether it has a "greedy" high-startup-cost consumer like a sort or
hash node where the leader will contribute a whole backend's full
attention as soon as it executes the plan, or a lazy consumer where
the leader will probably not contribute much if there are enough
workers to keep it distracted. In the case of a Gather Merge -> Sort
-> Parallel Seq Scan plan, I think we will overestimate the number of
rows (per participant), because cost_seqscan will guess that the
leader is spending 30% of its time per worker servicing the workers,
when in fact it will be sucking tuples into a sort node just as fast
as anyone else. But I don't see what this patch can do about that...

+ * When force is true, function reads the tuple into wait mode. For gather
+ * merge we need to fill the slot from which we returned the earlier tuple, so
+ * this require tuple to be read into wait mode. During initialization phase,
+ * once we try to read the tuple into no-wait mode as we want to initialize all
+ * the readers. Refer gather_merge_init() for more details.
+ *
+ * Function returns true if found tuple for the reader, otherwise returns
+ * false.
+ */
+static bool
+gather_merge_readnext(GatherMergeState *gm_state, int reader, bool force)

s/into wait mode/in wait mode/

This appears throughout the comments; not sure if I can explain this
well but "in wait mode" describes a state of being which is wanted
here, "into wait mode" describes some kind of change or movement or
insertion.

Perhaps it would be better to say "reads the tuple _queue_ in wait
mode", just to make clearer that this is talking about the wait/nowait
feature of tuple queues, and perhaps also note that the leader always
waits since it executes the plan.

Maybe we should use "bool nowait" here anway, mirroring the TupleQueue
interface? Why introduce another terminology for the same thing with
inverted sense?

+/*
+ * Read the tuple for given reader into nowait mode, and form the tuple array.
+ */
+static void
+form_tuple_array(GatherMergeState *gm_state, int reader)

This function is stangely named. How about try_to_fill_tuple_buffer
or something?

+ GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader];

I wonder if the purpose of gm_tuple, would be clearer if it were
called gm_tuple_buffers. Plural because it holds one buffer per
reader. Then in that variable on the left hand side there could be
called tuple_buffer (singular), because it's the buffer of tuples for
one single reader.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2016-11-04 03:49:02 Re: Push down more full joins in postgres_fdw
Previous Message Etsuro Fujita 2016-11-04 02:59:47 Re: Confusing docs about GetForeignUpperPaths in fdwhandler.sgml