Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2017-01-31 04:46:14
Message-ID: CAEepm=3153WoNfXjEokVD-5XuxCe5GE5epp57ttBrYvJdasdMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 4, 2017 at 12:53 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> Attached is V7 of the patch.

I am doing some testing. First, some superficial things from first pass:

Still applies with some offsets and one easy-to-fix rejected hunk in
nbtree.c (removing some #include directives and a struct definition).

+/* Sort parallel code from state for sort__start probes */
+#define PARALLEL_SORT(state) ((state)->shared == NULL ? 0 : \
+ (state)->workerNum >= 0 : 1 : 2)

Typo: ':' instead of '?', --enable-dtrace build fails.

+ the entire utlity command, regardless of the number of

Typo: s/utlity/utility/

+ /* Perform sorting of spool, and possibly a spool2 */
+ sortmem = Max(maintenance_work_mem / btshared->scantuplesortstates, 64);

Just an observation: if you ask for a large number of workers, but
only one can be launched, it will be constrained to a small fraction
of maintenance_work_mem, but use only one worker. That's probably OK,
and I don't see how to do anything about it unless you are prepared to
make workers wait for an initial message from the leader to inform
them how many were launched.

Should this 64KB minimum be mentioned in the documentation?

+ if (!btspool->isunique)
+ {
+ shm_toc_estimate_keys(&pcxt->estimator, 2);
+ }

Project style: people always tell me to drop the curlies in cases like
that. There are a few more examples in the patch.

+ /* Wait for workers */
+ ConditionVariableSleep(&shared->workersFinishedCv,
+ WAIT_EVENT_PARALLEL_FINISH);

I don't think we should reuse WAIT_EVENT_PARALLEL_FINISH in
tuplesort_leader_wait and worker_wait. That belongs to
WaitForParallelWorkersToFinish, so someone who see that in
pg_stat_activity won't know which it is.

IIUC worker_wait() is only being used to keep the worker around so its
files aren't deleted. Once buffile cleanup is changed to be
ref-counted (in an on_dsm_detach hook?) then workers might as well
exit sooner, freeing up a worker slot... do I have that right?

Incidentally, barrier.c could probably be used for this
synchronisation instead of these functions. I think
_bt_begin_parallel would call BarrierInit(&shared->barrier,
scantuplesortstates) and then after LaunchParallelWorkers() it'd call
a new interface BarrierDetachN(&shared->barrier, scantuplesortstates -
pcxt->nworkers_launched) to forget about workers that failed to
launch. Then you could use BarrierWait where the leader waits for the
workers to finish, and BarrierDetach where the workers are finished
and want to exit.

+ /* Prepare state to create unified tapeset */
+ leaderTapes = palloc(sizeof(TapeShare) * state->maxTapes);

Missing cast (TapeShare *) here? Project style judging by code I've
seen, and avoids gratuitously C++ incompatibility.

+_bt_parallel_shared_estimate(Snapshot snapshot)
...
+tuplesort_estimate_shared(int nWorkers)

Inconsistent naming?

More soon.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-01-31 04:48:39 Re: An isolation test for SERIALIZABLE READ ONLY DEFERRABLE
Previous Message Michael Paquier 2017-01-31 04:45:36 Re: Renaming of pg_xlog and pg_clog