Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-20 20:09:23
Message-ID: CAH2-WzkRQXqSjgVGzse0Vz6naM+ecX5j+s5EfyAOKjCCO1Xx2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 19, 2018 at 9:29 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> I think I can see why this patch needs that. Is it mainly for the
> work you are doing in _bt_leader_heapscan where you are waiting for
> all the workers to be finished?

Yes, though it's also needed for the leader tuplesort. It needs to be
able to discover worker runs by looking for temp files named 0 through
to $NWORKERS - 1.

The problem with seeing who shows up after a period of time, and
having the leader arbitrarily determine that to be the total number of
participants (while blocking further participants from joining) is
that I don't know *how long to wait*. This would probably work okay
for parallel CREATE INDEX, when the leader participates as a worker,
because you can check only when the leader is finished acting as a
worker. It stands to reason that that's enough time for worker
processes to at least show up, and be seen to show up. We can use the
duration of the leader's participation as a worker as a natural way to
decide how long to wait.

But what when the leader doesn't participate as a worker, for whatever
reason? Other uses for parallel tuplesort might typically have much
less leader participation as compared to parallel CREATE INDEX. In
short, ISTM that seeing who shows up is a bad strategy for parallel
tuplesort.

> I think till now we don't have any such requirement, but if it is must
> for this patch, then I don't think it is tough to do that. We need to
> write an API WaitForParallelWorkerToAttach() and then call for each
> launched worker or maybe WaitForParallelWorkersToAttach() which can
> wait for all workers to attach and report how many have successfully
> attached. It will have functionality of
> WaitForBackgroundWorkerStartup and additionally it needs to check if
> the worker is attached to the error queue. We already have similar
> API (WaitForReplicationWorkerAttach) for logical replication workers
> as well. Note that it might have a slight impact on the performance
> because with this you need to wait for the workers to startup before
> doing any actual work, but I don't think it should be noticeable for
> large operations especially for operations like parallel create index.

Actually, though it doesn't really look like it from the way things
are structured within nbtsort.c, I don't need to wait for workers to
start up (call the WaitForParallelWorkerToAttach() function you
sketched) before doing any real work within the leader. The leader can
participate as a worker, and only do this check afterwards. That will
work because the leader Tuplesortstate has yet to do any real work.
Nothing stops me from adding a new function to tuplesort, for the
leader, that lets the leader say: "New plan -- you should now expect
this many participants" (leader takes this reliable number from
eventual call to WaitForParallelWorkerToAttach()).

I admit that I had no idea that there is this issue with
nworkers_launched until very recently. But then, that field has
absolutely no comments.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-01-20 22:36:13 Re: [HACKERS] WIP: Covering + unique indexes.
Previous Message Magnus Hagander 2018-01-20 18:30:43 Re: Use of term hostaddrs for multiple hostaddr values