Re: parallel mode and parallel contexts

From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-05-07 07:09:54
Message-ID: 20150507070954.GA3557937@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, May 05, 2015 at 01:05:38PM -0400, Robert Haas wrote:
> On Sat, May 2, 2015 at 12:35 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > Perhaps, rather than model it as master M waiting on worker list W1|W2|W3,
> > model it with queue-nonempty and queue-nonfull events, one pair per queue.

That comment of mine was junk; it suggested a data structure containing the
same worker list it purported to remove. Oops.

> Each individual queue has only a single reader and a single writer.
> In your example, there would be three queues, not just one. Of
> course, one could design a new shared memory data structure
> representing a collection of related queues, but I still don't see
> exactly how we get around the problem of that requiring
> O(MaxBackends^2) storage space.

If each backend waits on a uniformly-distributed 50% of other backends,
tracking that wait graph indeed requires O(MaxBackends^2) space. Backends
completing useful work in the field will have far-sparser wait graphs, and
that should inform the choice of data structure:

On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote:
> When I first started thinking about how to fix this, I said, well,
> that's no big deal, we can just advertise the whole list of processes
> that we're waiting for in shared memory, rather than just the one.

That boiled down to representing the wait graph with an adjacency list, which
sounds like an efficient choice.

> > M
> > subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M
> > publishes to queue0-nonfull, and the workers subscribe. An edge forms in the
> > deadlock detector's graph when all of an event's subscribers wait on it. (One
> > can approximate that in userspace with a pair of advisory locks.)
>
> An edge between which two processes?

I hadn't de-fuzzed my thinking that far yet. If you still need the answer
after this message, let me know, and I'll work on that. As a guess, I think
it's three edges M-W1, M-W2 and M-W3.

> >> After thinking about it a bit more, I realized that even if we settle
> >> on some solution to that problem, there's another issues: the
> >> wait-edges created by this system don't really have the same semantics
> >> as regular lock waits. Suppose A is waiting on a lock held by B and B
> >> is waiting for a lock held by A; that's a deadlock. But if A is
> >> waiting for B to write to a tuple queue and B is waiting for A to read
> >> from a tuple queue, that's not a deadlock if the queues in question
> >> are the same.
> >
> > I do see a deadlock. B wants to block until A reads from the queue, so it
> > advertises that and sleeps. Waking up deadlock_timeout later, it notices that
> > A is waiting for B to write something. B will not spontaneously suspend
> > waiting to write to the queue, nor will A suspend waiting to read from the
> > queue. Thus, the deadlock is valid. This assumes the deadlock detector
> > reasons from an authoritative picture of pending waits and that we reliably
> > wake up a process when the condition it sought has arrived. We have that for
> > heavyweight lock acquisitions. The assumption may be incompatible with
> > Andres's hope, quoted above, of avoiding the full lock acquisition procedure.
>
> I don't understand. What's typically going to happen here is that the
> queue will initially be empty, and A will block. Suppose B has a
> large message to write to the queue, let's say 100x the queue size.
> It will write the first 1% of the message into the queue, set A's
> latch, and go to sleep. A will now wake up, drain the queue, set B's
> latch, and go to sleep. B will then wake up, write the next chunk of
> the message, set A's latch again, and go to sleep. They'll go back
> and forth like this until the entire message has been transmitted.
> There's no deadlock here: everything is working as designed. But
> there may be instants where both A and B are waiting, because (for
> example) A may set B's latch and finish going to sleep before B gets
> around to waking up.

I see what you had in mind. The problem showed up in the last sentence;
namely, the hand-off from process to process is not atomic like it is for
heavyweight locks. That's exactly what I was (not too clearly) getting at
when I wrote, "This assumes ..." above. For the sake of illustration, suppose
you replace your queue in this algorithm with a heavyweight lock and a small
buffer. Initially, B holds the lock and A waits for the lock. The following
event sequence repeats until the transfer is done:

B fills the buffer
B releases the lock, granting it to A during LockRelease()
B starts waiting for the lock
A empties the buffer
A releases the lock, granting it to B during LockRelease()
A starts waiting for the lock

The deadlock detector will rightly never report a deadlock. To have the same
safety in your example, it's not enough for one process to set the latch of
another process. The process must update something in static shared memory to
indicate that the waited-for condition (queue-nonempty for A, queue-nonfull
for B) is now satisfied. You need such a unit of state anyway, don't you?
How else would the deadlock detector know that A waits for B to write to the
tuple queue? (When the deadlock detector notices this deadlock, it may be
running in a process not participating in the parallel group. It can't rely
on anything in backend or dynamic shared memory.) With that, the deadlock
detector can distinguish a process waiting for an yet-unsatisfied condition
from a process that will soon wake to exploit a recently-satisfied condition.

> That's probably something that we could patch around, but I think it's
> missing the larger point. When you're dealing with locks, there is
> one operation that causes blocking (acquiring a lock) and another
> operation that unblocks other processes (releasing a lock). With
> message queues, there are still two operations (reading and writing),
> but either of them can either cause you to block yourself, or to
> unblock another process. To put that another way, locks enforce
> mutual exclusion; message queues force mutual *inclusion*. Locks
> cause blocking when two processes try to operate on the same object at
> the same time; message queues cause blocking *unless* two processes
> operate on the same object at the same time. That difference seems to
> me to be quite fundamental.

Interesting thought.

> That problem is actually not too hard to avoid if you're OK with
> extremely frequent lock manager manipulations. It's not a deadlock if
> a worker is waiting for a lock (say, a relation extension lock) held
> by the leader, because the leader might release that lock quickly.
> But if the leader gets to the top of the funnel node's main loop and
> one of its workers is still waiting for the lock at that point, that
> is almost certainly a deadlock. Locks might get taken and released
> during a single pass through that loop, but any lock held across
> iterations is presumably going to be held until end-of-transaction, so
> we're hosed. But checking for deadlock at the top of every main loop
> iteration is far too expensive to contemplate. Even doing some much
> lighter-weight manipulation of the lock manager data structures at the
> top of every loop iteration is probably going to recreate the same
> kind of lock manager contention problems that I tried to solve with
> the fast-path locking stuff in 9.2.

Agreed.

> To put all of my cards on the table, I've never really believed in
> this approach. The reason we have a deadlock detector in the first
> place is because we can't prevent users from making fundamentally
> incompatible locking requests, like locking two tables in incompatible
> orderings in two different sessions. But we *don't* have deadlock
> detection for lwlocks because we (rightly) view it as our job to write
> the code in such a way that deadlocks don't happen in the first place.
> So we take locks in a predictable order, even in cases (like updating
> a tuple) where that involves some fancy gymnastics. We could have
> update always lock the old-tuple buffer before the new-tuple buffer
> and make it the deadlock detector's job to sort that out, but that's
> pretty obviously inferior. We can expect users to tolerate a deadlock
> when it is their own actions that have precipitated the deadlock, but
> it's a whole different thing to ask them to accept that what from the
> user's point of view is an indivisible action (like an UPDATE, or a
> parallel query, or perhaps an UPSERT) occasionally deadlocks for some
> reason that they can neither understand nor prevent. Tuple queues are
> an implementation detail. They shouldn't deadlock for the same reason
> that lwlock acquisition shouldn't deadlock, and this whole project
> should be entirely unnecessary.
>
> The way we backed into worrying about exposing tuple queues to the
> deadlock detector is that, if you let one backend in a parallel group
> get and keep a lock on some resource and then another backend in the
> same group tries to lock that resource and can't get the lock, you
> will eventually an undetected deadlock. Andres's view is that we
> should fix the deadlock detector to detect and report such cases, but
> I think that's wrong. If a process in a parallel group can take and
> retain until end of transaction a lock on some resource without which
> other processes in the same parallel group cannot do useful work,
> that's a BUG. We shouldn't need the deadlock detector to report that
> problem for the same reason we shouldn't need the deadlock detector to
> report lwlock-based deadlocks: if they are ever happening, you need to
> fix your code, not the deadlock detector.

The number of held LWLocks is always zero when entering user-defined code and
again when exiting user-defined code. Therefore, the possible LWLock wait
graphs are known at compile time, and we could prove that none contain a
deadlock. Therefore, we scarcely miss having an LWLock deadlock detector.
That does not map to the tuple queue behavior at hand, because we hope to
allow the queue's writer to run user-defined code while the queue's reader is
waiting. My term "user-defined code" does come with some hand-waving. A
superuser can cause an undetected deadlock via a C-language hook. We would
not call that a PostgreSQL bug, though the hook is literally user-defined.
Let's keep the deadlock detector able to identify every deadlock a
non-superuser can cause. That includes deadlocks arising from heavyweight
lock acquisition in parallel-safe functions.

> I think where this project went off the rails is when we made the
> decision as to how to fix the problems in the original group locking
> approach. In the original concept, locks between processes in the
> same parallel group just didn't ever conflict; two
> otherwise-conflicting locks held by different backends within the same
> group were regarded as those processes sharing that lock. Andres and
> possibly others pointed out that for stuff like relation locks, that's

I think you mean "relation extension locks".

> probably not the right behavior. I agree with that. It was also
> suggested that this would be less scary if we limited ourselves to
> sharing the locks held at the beginning of parallelism. That had the
> further advantage of making things like relation extension locks,
> which won't be held at the starting of paralellism, unshared, while
> relation locks would, in most cases, be shared. That felt pretty
> good, so I did it, but I now think that was a mistake, because it
> creates edge cases where parallel groups can self-deadlock. If we
> instead regard relation locks between parallel group members as NEVER
> conflicting, rather than conflicting only when both locks were
> acquired after the start of parallelism, those edge cases go away.

Yes. Then you're back to something like the LWLock scenario, where an
undetected deadlock implies a PostgreSQL bug. That's a good place to be.

> The only downside is that if a worker A manages to do something to a
> relation R that makes it unsafe for worker B to access, and worker B
> then gets the lock anyway, we get no-holds-barred insanity instead of
> a deadlock. But I am unconvinced that downside amounts to much,
> because I believe the only way that A can make it unsafe for B to
> access the relation is by doing something that CheckTableNotInUse()
> will already prohibit in parallel mode. If it turns out there is some
> oversight there, I don't think it'll be hard to plug.

This is the bottom line, and I agree. I wrote more about that here:
http://www.postgresql.org/message-id/20141226040546.GC1971688@tornado.leadboat.com

I admit that it's alarming to have different conflict semantics by locktag.
The tension originates, I think, from e.g. LOCKTAG_RELATION_EXTEND serving two
distinct purposes simultaneously. As I wrote in the mail just cited, before
altering a table in a manner that threatens concurrent usage, one must ensure
that (a) other transactions and (b) other parts of my own transaction aren't
in the way. Customarily, a heavyweight lock rules out (a), and
CheckTableNotInUse() rules out (b). Relation extension does not have or need
a CheckTableNotInUse() call or similar, because it doesn't call arbitrary code
that might reenter the relation extension process. Under parallelism, though,
it will be possible for multiple processes of a given transaction to attempt
relation extension simultaneously. So we need a (b) test. It just so happens
that allowing LOCKTAG_RELATION_EXTEND to conflict in a parallel group causes
that lock acquisition to suffice for both purposes (a) and (b). That's neat
but arguably impure. Every cure I've pondered has been worse than the
disease, though. It will be okay.

> In contrast,
> exposing this tuple queue wait information to the deadlock detector is
> looking quite complex.

Yep.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2015-05-07 07:23:12 Re: Parallel Seq Scan
Previous Message Magnus Hagander 2015-05-07 06:10:03 Re: Disabling trust/ident authentication configure option