Re: Gather Merge

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Gather Merge
Date: 2016-11-11 12:58:02
Message-ID: CAGPqQf2_tvk4xyawufmoq1hkCO5trEu8CVU0m0+UQp_N30=A9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Oops forgot to attach latest patch in the earlier mail.

On Fri, Nov 11, 2016 at 6:26 PM, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
wrote:

>
>
> On Fri, Nov 4, 2016 at 8:30 AM, Thomas Munro <
> thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
>> On Thu, Oct 27, 2016 at 10:50 PM, Rushabh Lathia
>> <rushabh(dot)lathia(at)gmail(dot)com> wrote:
>> > Please find attached latest patch which fix the review point as well as
>> > additional clean-up.
>>
>> I've signed up to review this patch and I'm planning to do some
>> testing. Here's some initial feedback after a quick read-through:
>>
>>
> Thanks Thomas.
>
>
>> + if (gather_merge_readnext(gm_state, i, initialize ? false : true))
>>
>> Clunky ternary operator... how about "!initialize".
>>
>>
> Fixed.
>
>
>> +/*
>> + * Function clear out a slot in the tuple table for each gather merge
>> + * slots and returns the clear clear slot.
>> + */
>>
>> Maybe better like this? "_Clear_ out a slot in the tuple table for
>> each gather merge _slot_ and _return_ the _cleared_ slot."
>>
>>
> Fixed.
>
>
>> + /* Free tuple array as we no more need it */
>>
>> "... as we don't need it any more"
>>
>>
> Fixed
>
>
>> +/*
>> + * Read the next tuple for gather merge.
>> + *
>> + * Function fetch the sorted tuple out of the heap.
>> + */
>>
>> "_Fetch_ the sorted tuple out of the heap."
>>
>>
> Fixed
>
>
>> + * Otherwise, pull the next tuple from whichever participate we
>> + * returned from last time, and reinsert the index into the heap,
>> + * because it might now compare differently against the existing
>>
>> s/participate/participant/
>>
>>
> Fixed.
>
>
>> + * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
>> + * Portions Copyright (c) 1994, Regents of the University of California
>>
>> Shouldn't this say just "(c) 2016, PostgreSQL Global Development
>> Group"?
>
>
> Fixed.
>
>
>> Are we supposed to be blaming the University of California
>> for new files?
>>
>>
> Not quite sure about this, so keeping this as it is.
>
>
>> +#include "executor/tqueue.h"
>> +#include "miscadmin.h"
>> +#include "utils/memutils.h"
>> +#include "utils/rel.h"
>> +#include "lib/binaryheap.h"
>>
>> Not correctly sorted.
>>
>>
> Copied from nodeGather.c. But Fixed here.
>
>
>> + /*
>> + * store the tuple descriptor into gather merge state, so we can use it
>> + * later while initilizing the gather merge slots.
>> + */
>>
>> s/initilizing/initializing/
>>
>>
> Fixed.
>
>
>> +/* ----------------------------------------------------------------
>> + * ExecEndGatherMerge
>> + *
>> + * frees any storage allocated through C routines.
>> + * ----------------------------------------------------------------
>>
>> The convention in Postgres code seems to be to use a form like "Free
>> any storage ..." in function documentation. Not sure if that's an
>> imperative, an infinitive, or if the word "we" is omitted since
>> English is so fuzzy like that, but it's inconsistent with other
>> documentation to use "frees" here. Oh, I see that exact wording is in
>> several other files. I guess I'll just leave this as a complaint
>> about all those files then :-)
>>
>>
> Sure.
>
>
>> + * Pull atleast single tuple from each worker + leader and set up the
>> heap.
>>
>> s/atleast single/at least a single/
>>
>>
> Fixed.
>
>
>> + * Read the tuple for given reader into nowait mode, and form the tuple
>> array.
>>
>> s/ into / in /
>>
>>
> Fixed.
>
>
>> + * Function attempt to read tuple for the given reader and store it into
>> reader
>>
>> s/Function attempt /Attempt /
>>
>>
> Fixed.
>
>
>> + * Function returns true if found tuple for the reader, otherwise returns
>>
>> s/Function returns /Return /
>>
>>
> Fixed.
>
>
>> + * First try to read tuple for each worker (including leader) into nowait
>> + * mode, so that we initialize read from each worker as well as leader.
>>
>> I wonder if it would be good to standardise on the terminology we use
>> when we mean workers AND the leader. In my Parallel Shared Hash work,
>> I've been saying 'participants' if I mean = workers + leader. What do
>> you think?
>>
>>
> I am not quite sure about participants. In my options when we explicitly
> say workers + leader its more clear. I am open to change is if committer
> thinks otherwise.
>
>
>
>> + * After this if all active worker unable to produce the tuple, then
>> + * re-read and this time read the tuple into wait mode. For the worker,
>> + * which was able to produced single tuple in the earlier loop and still
>> + * active, just try fill the tuple array if more tuples available.
>> + */
>>
>> How about this? "After this, if all active workers are unable to
>> produce a tuple, then re-read and this time us wait mode. For workers
>> that were able to produce a tuple in the earlier loop and are still
>> active, just try to fill the tuple array if more tuples are
>> available."
>>
>>
> Fixed.
>
>
>> + * The heap is never spilled to disk, since we assume N is not very
>> large. So
>> + * this is much simple then cost_sort.
>>
>> s/much simple then/much simpler than/
>>
>>
> Fixed.
>
>
>> + /*
>> + * Avoid log(0)...
>> + */
>> + N = (path->num_workers < 2) ? 2.0 : (double) path->num_workers;
>> + logN = LOG2(N);
>> ...
>> + /* Per-tuple heap maintenance cost */
>> + run_cost += path->path.rows * comparison_cost * 2.0 * logN;
>>
>> Why multiply by two? The comment above this code says "about log2(N)
>> comparisons to delete the top heap entry and another log2(N)
>> comparisons to insert its successor". In fact gather_merge_getnext
>> calls binaryheap_replace_first, which replaces the top element without
>> any comparisons at all and then performs a sift-down in log2(N)
>> comparisons to find its new position. There is no per-tuple "delete"
>> involved. We "replace" the top element with the value it already had,
>> just to trigger the sift-down, because we know that our comparator
>> function might have a new opinion of the sort order of this element.
>> Very clever! The comment and the 2.0 factor in cost_gather_merge seem
>> to be wrong though -- or am I misreading the code?
>>
>> See cost_merge_append.
>
>
>> Also, shouldn't we add 1 to N to account for the leader? Suppose
>> there are 2 workers. There are 3 elements in the binary heap. The
>> element to be sifted down must be compared against either 1 or 2
>> others to reorganise the heap. Surely in that case we should estimate
>> log2(3) = ~1.58 comparisons, not log2(2) = 1 comparison.
>>
>
> Yes, good catch. For Gather Merge leader always participate, so
> we should num_workers + 1.
>
>
>>
>> I suspect that the leader's contribution will be equivalent to a whole
>> worker if the plan involves a sort: as soon as the leader pulls a
>> tuple in gather_merge_init, the sort node will pull all the tuples it
>> can in a tight loop. It's unfortunate that cost_seqscan has to
>> estimate what the leader's contribution will be without knowing
>> whether it has a "greedy" high-startup-cost consumer like a sort or
>> hash node where the leader will contribute a whole backend's full
>> attention as soon as it executes the plan, or a lazy consumer where
>> the leader will probably not contribute much if there are enough
>> workers to keep it distracted. In the case of a Gather Merge -> Sort
>> -> Parallel Seq Scan plan, I think we will overestimate the number of
>> rows (per participant), because cost_seqscan will guess that the
>> leader is spending 30% of its time per worker servicing the workers,
>> when in fact it will be sucking tuples into a sort node just as fast
>> as anyone else. But I don't see what this patch can do about that...
>>
>>
> Exactly. There is very thin line - when it comes to calculating the cost.
> In general, while calculating the cost for GM, I just tried to be similar
> to the Gather + MergeAppend.
>
>
>> + * When force is true, function reads the tuple into wait mode. For
>> gather
>> + * merge we need to fill the slot from which we returned the earlier
>> tuple, so
>> + * this require tuple to be read into wait mode. During initialization
>> phase,
>> + * once we try to read the tuple into no-wait mode as we want to
>> initialize all
>> + * the readers. Refer gather_merge_init() for more details.
>> + *
>> + * Function returns true if found tuple for the reader, otherwise returns
>> + * false.
>> + */
>> +static bool
>> +gather_merge_readnext(GatherMergeState *gm_state, int reader, bool
>> force)
>>
>> s/into wait mode/in wait mode/
>>
>> This appears throughout the comments; not sure if I can explain this
>> well but "in wait mode" describes a state of being which is wanted
>> here, "into wait mode" describes some kind of change or movement or
>> insertion.
>>
>> Perhaps it would be better to say "reads the tuple _queue_ in wait
>> mode", just to make clearer that this is talking about the wait/nowait
>> feature of tuple queues, and perhaps also note that the leader always
>> waits since it executes the plan.
>>
>>
> Fixed. Just choose to s/into wait mode/in wait mode/
>
> Maybe we should use "bool nowait" here anway, mirroring the TupleQueue
>> interface? Why introduce another terminology for the same thing with
>> inverted sense?
>>
>>
> Agree with you. Changed the function gm_readnext_tuple() &
> gather_merge_readnext()
> APIs.
>
>
>> +/*
>> + * Read the tuple for given reader into nowait mode, and form the tuple
>> array.
>> + */
>> +static void
>> +form_tuple_array(GatherMergeState *gm_state, int reader)
>>
>> This function is stangely named. How about try_to_fill_tuple_buffer
>> or something?
>>
>> + GMReaderTuple *gm_tuple = &gm_state->gm_tuple[reader];
>>
>> I wonder if the purpose of gm_tuple, would be clearer if it were
>> called gm_tuple_buffers. Plural because it holds one buffer per
>> reader. Then in that variable on the left hand side there could be
>> called tuple_buffer (singular), because it's the buffer of tuples for
>> one single reader.
>>
>>
> Yes, you are right. I renamed the variable as well as structure.
>
> PFA latest patch which address the review comments as
> well as few other clean ups.
>
> Apart from this my colleague Rafia Sabih reported one regression with
> GM. Which was like, if we set work_mem enough to accommodate the
> sort operation - in such case GM path get select even though Sort
> performs much better.
>
> Example:
>
> create table t (i int);
> insert into t values(generate_series(1,10000000));
> set work_mem =1024000;
> explain analyze select * from t order by i;
> set enable_gathermerge =off;
> explain analyze select * from t order by i;
>
> postgres=# explain analyze select * from t order by i;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------------
> Gather Merge (cost=335916.26..648415.76 rows=2499996 width=4) (actual time=2234.145..7628.555 rows=10000000 loops=1)
> Workers Planned: 4
> Workers Launched: 4
> -> Sort (cost=334916.22..341166.21 rows=2499996 width=4) (actual time=2226.609..2611.041 rows=2000000 loops=5)
> Sort Key: i
> Sort Method: quicksort Memory: 147669kB
> -> Parallel Seq Scan on t (cost=0.00..69247.96 rows=2499996 width=4) (actual time=0.034..323.129 rows=2000000 loops=5)
> Planning time: 0.061 ms
> Execution time: 8143.809 ms
> (9 rows)
>
> postgres=# set enable_gathermerge = off;
> SET
> postgres=# explain analyze select * from t order by i;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------
> Sort (cost=1306920.83..1331920.79 rows=9999985 width=4) (actual time=3521.143..4854.148 rows=10000000 loops=1)
> Sort Key: i
> Sort Method: quicksort Memory: 854075kB
> -> Seq Scan on t (cost=0.00..144247.85 rows=9999985 width=4) (actual time=0.113..1340.758 rows=10000000 loops=1)
> Planning time: 0.100 ms
> Execution time: 5535.560 ms
> (6 rows)
>
>
> Looking at the plan I realize that this is happening because wrong costing
> for Gather Merge. Here in the plan we can see the row estimated by
> Gather Merge is wrong. This is because earlier patch GM was considering
> rows = subpath->rows, which is not true as subpath is partial path. So
> we need to multiple it with number of worker. Attached patch also fixed
> this issues. I also run the TPC-H benchmark with the patch and results
> are same as earlier.
>
>
> Thanks,
> Rushabh Lathia
> www.EnterpriseDB.com
>

--
Rushabh Lathia

Attachment Content-Type Size
gather_merge_v4.patch application/x-download 50.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-11-11 13:03:53 Re: Adding in docs the meaning of pg_stat_replication.sync_state
Previous Message Rushabh Lathia 2016-11-11 12:56:33 Re: Gather Merge