Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-12-08 00:57:26
Message-ID: CAH2-WzkYZLJ+_DnUm=x+eu5GBCY-NAxBvEoh6i6iR1aj_HmqmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 7, 2017 at 12:25 AM, Rushabh Lathia
<rushabh(dot)lathia(at)gmail(dot)com> wrote:
> 0001-Add-parallel-B-tree-index-build-sorting_v14.patch

Cool. I'm glad that we now have a patch that applies cleanly against
master, while adding very little to buffile.c. It feels like we're
getting very close here.

>> * You didn't point out the randomAccess restriction in tuplesort.h.
>>
>
> I did, it's there in the file header comments.

I see what you wrote in tuplesort.h here:

> + * algorithm, and are typically only used for large amounts of data. Note
> + * that parallel sorts is not support for random access to the sort result.

This should say "...are not support when random access is requested".

> Added handling for parallel_leader_participation as well as deleted
> compile time option force_single_worker.

I still see this:

> +
> +/*
> + * A parallel sort with one worker process, and without any leader-as-worker
> + * state may be used for testing the parallel tuplesort infrastructure.
> + */
> +#ifdef NOT_USED
> +#define FORCE_SINGLE_WORKER
> +#endif

Looks like you missed this FORCE_SINGLE_WORKER hunk -- please remove it, too.

>> The parallel_leader_participation docs will also need to be updated.
>>
>
> Done.

I don't see this. There is no reference to
parallel_leader_participation in the CREATE INDEX docs, nor is there a
reference to CREATE INDEX in the parallel_leader_participation docs.

> Also performed more testing with the patch, with
> parallel_leader_participation
> ON and OFF. Found one issue, where earlier we always used to call
> _bt_leader_sort_as_worker() but now need to skip the call if
> parallel_leader_participation
> is OFF.

Hmm. I think the local variable within _bt_heapscan() should go back.
Its value should be directly taken from parallel_leader_participation
assignment, once. There might be some bizarre circumstances where it
is possible for the value of parallel_leader_participation to change
in flight, causing a race condition: we start with the leader as a
participant, and change our mind later within
_bt_leader_sort_as_worker(), causing the whole CREATE INDEX to hang
forever.

Even if that's impossible, it seems like an improvement in style to go
back to one local variable controlling everything.

Style issue here:

> + long start_block = file->numFiles * BUFFILE_SEG_SIZE;
> + int newNumFiles = file->numFiles + source->numFiles;

Shouldn't start_block conform to the surrounding camelCase style?

Finally, two new thoughts on the patch, that are not responses to
anything you did in v14:

1. Thomas' barrier abstraction was added by commit 1145acc7. I think
that you should use a static barrier in tuplesort.c now, and rip out
the ConditionVariable fields in the Sharedsort struct. It's only a
slightly higher level of abstraction for tuplesort.c, which makes only
a small difference given the simple requirements of tuplesort.c.
However, I see no reason to not go that way if that's the new
standard, which it is. This looks like it will be fairly easy.

2. Does the plan_create_index_workers() cost model need to account for
parallel_leader_participation, too, when capping workers? I think that
it does.

The relevant planner code is:

> + /*
> + * Cap workers based on available maintenance_work_mem as needed.
> + *
> + * Note that each tuplesort participant receives an even share of the
> + * total maintenance_work_mem budget. Aim to leave workers (where
> + * leader-as-worker Tuplesortstate counts as a worker) with no less than
> + * 32MB of memory. This leaves cases where maintenance_work_mem is set to
> + * 64MB immediately past the threshold of being capable of launching a
> + * single parallel worker to sort.
> + */
> + sort_mem_blocks = (maintenance_work_mem * 1024L) / BLCKSZ;
> + min_sort_mem_blocks = (32768L * 1024L) / BLCKSZ;
> + while (parallel_workers > min_parallel_workers &&
> + sort_mem_blocks / (parallel_workers + 1) < min_sort_mem_blocks)
> + parallel_workers--;

This parallel CREATE INDEX planner code snippet is about the need to
have low per-worker maintenance_work_mem availability prevent more
parallel workers from being added to the number that we plan to
launch. Each worker tuplesort state needs at least 32MB. We clearly
need to do something here.

While it's always true that "leader-as-worker Tuplesortstate counts as
a worker" in v14, I think that it should only be true in the next
revision of the patch when parallel_leader_participation is actually
true (IOW, we should only add 1 to parallel_workers within the loop
invariant in that case). The reason why we need to consider
parallel_leader_participation within this plan_create_index_workers()
code is simple: During execution, _bt_leader_sort_as_worker() uses
"worker tuplesort states"/btshared->scantuplesortstates to determine
how much of a share of maintenance_work_mem each worker tuplesort
gets. Our planner code needs to take that into account, now that the
nbtsort.c parallel_leader_participation behavior isn't just some
obscure debug option. IOW, the planner code needs to be consistent
with the nbtsort.c execution code.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2017-12-08 01:14:23 Re: Postgres with pthread
Previous Message Tatsuo Ishii 2017-12-08 00:54:48 Re: Add %r substitution for psql prompts to show recovery status