Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-19 01:53:57
Message-ID: CAH2-Wzk0XqQdAVE7duFx5h_vExKEELgKVf2KwEmUAAYj=zKShA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 18, 2018 at 9:22 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I just went back to the thread on "parallel.c oblivion of
> worker-startup failures" and refreshed my memory about what's going on
> over there. What's going on over there is (1) currently,
> nworkers_launched can be over-reported in a scenario that doesn't
> error out or crash and (2) I'm proposing to tighten things up so that
> this is no longer the case.

I think that we need to be able to rely on nworkers_launched to not
over-report the number of workers launched. To be fair to Amit, I
haven't actually gone off and studied the problem myself, so it's not
fair to dismiss his point of view. It nevertheless seems to me that it
makes life an awful lot easier to be able to rely on
nworkers_launched.

> So, thinking about this, I think that my proposal to use dynamic
> barriers here seems like it will work regardless of who wins that
> argument. Your proposal to use static barriers and decrement the
> party size based on the number of participants which fail to start
> will work if I win that argument, but will not work if Amit wins that
> argument.

Sorry, but I've changed my mind. I don't think barriers owned by
tuplesort.c will work for us (though I think we will still need a
synchronization primitive within nbtsort.c). The design that Robert
sketched for using barriers seemed fine at first. But then I realized:
what about the case where you have 2 spools?

I now understand why Thomas thought that I'd end up using static
barriers, because I now see that dynamic barriers have problems of
their own if used by tuplesort.c, even with the trick of only having
participants actually participate on the condition that they show up
before the party is over (before there is no tuples left for the
worker to consume). The idea of the leader using nworkers_launched as
the assumed-launched number of workers is pretty much baked into my
patch, because my patch makes tuplesorts composable (e.g. nbtsort.c
uses two tuplesorts when there is a unique index build/2 spools).

Do individual workers need to be prepared to back out of the main
spool's sort, but not the spool2 sort (for unique index builds), or
vice-versa? Clearly that's untenable, because they're going to need to
have both as long as they're participating in a parallel CREATE INDEX
(of a unique index) -- IndexBuildHeapScan() expects both at the same
time, but there is a race condition when launching workers with 2
spools. So does nbtsort.c need to own the barrier instead? If it does,
and if that barrier subsumes the responsibilities of tuplesort.c's
condition variables, then I don't see how that can avoid causing a
mess due to confusion about phases across tuplesorts/spools.

nbtsort.c *will* need some synchronization primitive, actually, (I'm
thinking of a condition variable), but only because of the fact that
nbtsort.c happens to want to aggregate statistics about the sort at
the end (for pg_index) -- this doesn't seem like tuplesort's problem
at all. In general, it's very natural to just call
tuplesort_leader_wait(), and have all the relevant details
encapsulated within tuplesort.c. We could make tuplesort_leader_wait()
totally optional, and just use the barrier within nbtsort.c for the
wait (more on that later).

> In particular, for parallel query, there is absolutely zero guarantee
> that every worker reaches every plan node. For a parallel utility
> command, things seem a little better: we can assume that workers are
> started only for one particular purpose. But even that might not be
> true in the future.

I expect workers that are reported launched to show up eventually, or
report failure. They don't strictly have to do any work beyond just
showing up (finding no tuples, reaching tuplesort_performsort(), then
finally reaching tuplesort_end()). The spool2 issue I describe above
shows why this is. They own the state (tuplesort tuples) that they
consume, and may possibly have 2 or more tuplesorts. If they cannot do
the bare minimum of checking in with us, then we're in big trouble,
because that's indistinguishable from their having actually sorted
some tuples without our knowing that the leader ultimately gets to
consume them.

It wouldn't be impossible to use barriers for everything. That just
seems to be incompatible with tuplesorts being composable. Long ago,
nbtsort.c actually did the sorting, too. If that was still true, then
it would be rather a lot more like parallel hashjoin, I think. You
could then just have one barrier for one state machine (with one or
two spools). It seems clear that we should avoid teaching tuplesort.c
about nbtsort.c.

> But, hang on a minute. Why do the workers need to wait for the leader
> anyway? Can't they just exit once they're done sorting? I think the
> original reason why this ended up in the patch is that we needed the
> leader to assume ownership of the tapes to avoid having the tapes get
> blown away when the worker exits. But, IIUC, with sharedfileset.c,
> that problem no longer exists.

You're right. This is why we could make calling
tuplesort_leader_wait() optional. We only need one condition variable
in tuplesort.c. Which makes me even less inclined to make the
remaining workersFinishedCv condition variable into a barrier, since
it's not at all barrier-like. After all, workers don't care about each
other's progress, or where the leader is. The leader needs to wait
until all known-launched participants report having finished, which
seems like a typical reason to use a condition variable. That doesn't
seem phase-like at all. As for workers, they don't have phases ("done"
isn't a phase for them, because as you say, there is no need for them
to wait until the leader says they can go with the shared fileset
stuff -- that's the leader's problem alone.)

I guess the fact that tuplesort_leader_wait() could be optional means
that it could be removed, which means that we could in fact throw out
the last condition variable within tuplesort.c, and fully rely on
using a barrier for everything within nbtsort.c. However,
tuplesort_leader_wait() seems kind of like something that we should
have on general principle. And, more importantly, it would be tricky
to use a barrier even for this, because we still have that baked-in
assumption that nworkers_launched is the single source of truth about
the number of participants.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-01-19 01:54:53 Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Previous Message David Fetter 2018-01-19 01:51:31 Re: make_etags: avoid recursive symbolic creation.