Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-12-08 07:28:59
Message-ID: CAGPqQf3M6GDfF60NaKYULAwmqSXpM4DpkQbYQcTsbo58bor5-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Thanks for review.

On Fri, Dec 8, 2017 at 6:27 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
> <rushabh(dot)lathia(at)gmail(dot)com> wrote:
> > 0001-Add-parallel-B-tree-index-build-sorting_v14.patch
>
> Cool. I'm glad that we now have a patch that applies cleanly against
> master, while adding very little to buffile.c. It feels like we're
> getting very close here.
>
> >> * You didn't point out the randomAccess restriction in tuplesort.h.
> >>
> >
> > I did, it's there in the file header comments.
>
> I see what you wrote in tuplesort.h here:
>
> > + * algorithm, and are typically only used for large amounts of data.
> Note
> > + * that parallel sorts is not support for random access to the sort
> result.
>
> This should say "...are not support when random access is requested".
>
>
Done.

> > Added handling for parallel_leader_participation as well as deleted
> > compile time option force_single_worker.
>
> I still see this:
>
> > +
> > +/*
> > + * A parallel sort with one worker process, and without any
> leader-as-worker
> > + * state may be used for testing the parallel tuplesort infrastructure.
> > + */
> > +#ifdef NOT_USED
> > +#define FORCE_SINGLE_WORKER
> > +#endif
>
> Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it,
> too.
>
>
Done.

> >> The parallel_leader_participation docs will also need to be updated.
> >>
> >
> > Done.
>
> I don't see this. There is no reference to
> parallel_leader_participation in the CREATE INDEX docs, nor is there a
> reference to CREATE INDEX in the parallel_leader_participation docs.
>
>
I thought parallel_leader_participation is generic GUC which get effect
for all parallel operation. isn't it? On that understanding I just update
the
documentation of parallel_leader_participation into config.sgml to
make it more generalize.

> > Also performed more testing with the patch, with
> > parallel_leader_participation
> > ON and OFF. Found one issue, where earlier we always used to call
> > _bt_leader_sort_as_worker() but now need to skip the call if
> > parallel_leader_participation
> > is OFF.
>
> Hmm. I think the local variable within _bt_heapscan() should go back.
> Its value should be directly taken from parallel_leader_participation
> assignment, once. There might be some bizarre circumstances where it
> is possible for the value of parallel_leader_participation to change
> in flight, causing a race condition: we start with the leader as a
> participant, and change our mind later within
> _bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang
> forever.
>
> Even if that's impossible, it seems like an improvement in style to go
> back to one local variable controlling everything.
>
>
Yes, to me also it's looks kind of impossible situation but then too
it make sense to make one local variable and then always read the
value from that.

> Style issue here:
>
> > + long start_block = file->numFiles * BUFFILE_SEG_SIZE;
> > + int newNumFiles = file->numFiles + source->numFiles;
>
> Shouldn't start_block conform to the surrounding camelCase style?
>
>
Done.

> Finally, two new thoughts on the patch, that are not responses to
> anything you did in v14:
>
> 1. Thomas' barrier abstraction was added by commit 1145acc7. I think
> that you should use a static barrier in tuplesort.c now, and rip out
> the ConditionVariable fields in the Sharedsort struct. It's only a
> slightly higher level of abstraction for tuplesort.c, which makes only
> a small difference given the simple requirements of tuplesort.c.
> However, I see no reason to not go that way if that's the new
> standard, which it is. This looks like it will be fairly easy.
>
>
Pending, as per Thomas' explanation, it seems like need some more
work in the barrier APIs.

> 2. Does the plan_create_index_workers() cost model need to account for
> parallel_leader_participation, too, when capping workers? I think that
> it does.
>
> The relevant planner code is:
>
> > + /*
> > + * Cap workers based on available maintenance_work_mem as needed.
> > + *
> > + * Note that each tuplesort participant receives an even share of the
> > + * total maintenance_work_mem budget. Aim to leave workers (where
> > + * leader-as-worker Tuplesortstate counts as a worker) with no less
> than
> > + * 32MB of memory. This leaves cases where maintenance_work_mem is
> set to
> > + * 64MB immediately past the threshold of being capable of launching
> a
> > + * single parallel worker to sort.
> > + */
> > + sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ;
> > + min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ;
> > + while (parallel_workers > min_parallel_workers &&
> > + sort_mem_blocks / (parallel_workers + 1) <
> min_sort_mem_blocks)
> > + parallel_workers--;
>
> This parallel CREATE INDEX planner code snippet is about the need to
> have low per-worker maintenance_work_mem availability prevent more
> parallel workers from being added to the number that we plan to
> launch. Each worker tuplesort state needs at least 32MB. We clearly
> need to do something here.
>
> While it's always true that "leader-as-worker Tuplesortstate counts as
> a worker" in v14, I think that it should only be true in the next
> revision of the patch when parallel_leader_participation is actually
> true (IOW, we should only add 1 to parallel_workers within the loop
> invariant in that case). The reason why we need to consider
> parallel_leader_participation within this plan_create_index_workers()
> code is simple: During execution, _bt_leader_sort_as_worker() uses
> "worker tuplesort states"/btshared->scantuplesortstates to determine
> how much of a share of maintenance_work_mem each worker tuplesort
> gets. Our planner code needs to take that into account, now that the
> nbtsort.c parallel_leader_participation behavior isn't just some
> obscure debug option. IOW, the planner code needs to be consistent
> with the nbtsort.c execution code.
>
>
Ah nice catch. I passed the local variable (leaderasworker) of
_bt_heapscan()
to plan_create_index_workers() rather than direct reading value from the
parallel_leader_participation (reasons are same as you explained earlier).

Thanks,
Rushabh Lathia
www.EnterpriseDB.com

Attachment Content-Type Size
0002-Adjust-planner-code-to-consider-parallel_leader_part.patch text/x-patch 3.2 KB
0001-Parallel-tuple-sort-implementation-for-CREATE-INDEX_v15.patch text/x-patch 150.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jakub Glapa 2017-12-08 07:59:07 Re: ERROR: too many dynamic shared memory segments
Previous Message Sergei Kornilov 2017-12-08 07:22:53 Re: proposal: alternative psql commands quit and exit