Re: parallel mode and parallel contexts

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-03-19 18:13:59
Message-ID: CA+TgmoZfSXZhS6qy4Z0786D7iU_AbhBVPQFwLthpSvGieczqHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Reading the README first, the rest later. So you can comment on my
> comments, while I actually look at the code. Parallelism, yay!

Sorry, we don't support parallelism yet. :-)

> On 2015-03-18 12:02:07 -0400, Robert Haas wrote:
>> +Instead, we take a more pragmatic approach: we try to make as many of the
>> +operations that are safe outside of parallel mode work correctly in parallel
>> +mode as well, and we try to prohibit the rest via suitable error
>> checks.
>
> I'd say that we'd try to prohibit a bunch of common cases. Among them
> the ones that are triggerable from SQL. We don't really try to prohibit
> many kinds of traps, as you describe.

I revised the text along these lines.

>> + - The values of all GUCs. Accordingly, permanent changes to the value of
>> + any GUC are forbidden while in parallel mode; but temporary changes,
>> + such as entering a function with non-NULL proconfig, are potentially OK.
>
> "potentially OK" sounds odd to me. Which danger are you seing that isn't
> relevan tfor norm

Removed "potentially".

>> + - The combo CID mappings. This is needed to ensure consistent answers to
>> + tuple visibility checks. The need to synchronize this data structure is
>> + a major reason why we can't support writes in parallel mode: such writes
>> + might create new combo CIDs, and we have now way to let other workers
>> + (or the initiating backend) know about them.
>
> Absolutely *not* in the initial version, but we really need to find a
> better solution for this.

I have some ideas, but that's not for right now.

>> + - The currently active user ID and security context. Note that this is
>> + the fourth user ID we restore: the initial step of binding to the correct
>> + database also involves restoring the authenticated user ID. When GUC
>> + values are restored, this incidentally sets SessionUserId and OuterUserId
>> + to the correct values. This final step restores CurrentUserId.
>
> Ah. That's the answer for above. Could you just move it next to the
> other user bit?

Well, I think it's good to keep this in the same order it happens.
That's almost true, with the exception of the libraries, which were
out of order. (I've fixed that now.)

>> +We could copy the entire transaction state stack,
>> +but most of it would be useless: for example, you can't roll back to a
>> +savepoint from within a parallel worker, and there are no resources to
>> +associated with the memory contexts or resource owners of intermediate
>> +subtransactions.
>
> I do wonder if we're not going to have to change this in the not too far
> away future. But then, this isn't externally visible at all, so
> whatever.

I definitely thought about copying the whole stack, but Heikki
suggested this approach, and I didn't see a reason to argue with it.
As you say, we can change it in the future if it proves problematic,
but so far I don't see a problem.

>> +At the end of a parallel operation, which can happen either because it
>> +completed successfully or because it was interrupted by an error, parallel
>> +workers associated with that operation exit. In the error case, transaction
>> +abort processing in the parallel leader kills of any remaining workers, and
>> +the parallel leader then waits for them to die.
>
> Very slightly awkward because first you talk about successful *or* error
> and then about abort processing.

I don't understand what's awkward about that. I make a general
statement about what happens at the end of a parallel operation, and
then the next few sentences follow up by explaining what happens in
the error case, and what happens in the success case.

>> + - Cleanup of pg_temp namespaces is not done. The initiating backend is
>> + responsible for this, too.
>
> How could a worker have its own pg_temp namespace?

It can't. My point here is that it won't clean up the master's
pg_temp namespace. I'll add an explicit prohibition against accessing
temporary relations and clarify these remarks.

>> +Lock Management
>> +===============
>> +
>> +Certain heavyweight locks that the initiating backend holds at the beginning
>> +of parallelism are copied to each worker, which unconditionally acquires them.
>> +The parallel backends acquire - without waiting - each such lock that the
>> +leader holds, even if that lock is self-exclusive. This creates the unusual
>> +situation that a lock which could normally only be held by a single backend
>> +can be shared among several backends in a parallel group.
>> +
>> +Obviously, this presents significant hazards that would not be present in
>> +normal execution. If, for example, a backend were to initiate parallelism
>> +while ReindexIsProcessingIndex() were true for some index, the parallel
>> +backends launched at that time would neither share this state nor be excluded
>> +from accessing the index via the heavyweight lock mechanism. It is therefore
>> +imperative that backends only initiate parallelism from places where it will
>> +be safe for parallel workers to access the relations on which they hold locks.
>> +It is also important that they not afterwards do anything which causes access
>> +to those relations to become unsafe, or at least not until after parallelism
>> +has concluded. The fact that parallelism is strictly read-only means that the
>> +opportunities for such mishaps are few and far between; furthermore, most
>> +operations which present these hazards are DDL operations, which will be
>> +rejected by CheckTableNotInUse() if parallel mode is active.
>> +
>> +Only relation locks, locks on database or shared objects, and perhaps the lock
>> +on our transaction ID are copied to workers.
>
> "perhaps"?

I just meant "if we have one". Changed to "and any lock on our transaction ID".

>> Advisory locks are not copied,
>> +but the leader may hold them at the start of parallelism; they cannot
>> +subsequently be manipulated while parallel mode is active. It is not safe to
>> +attempt parallelism while holding a lock of any other type, such as a page,
>> +tuple, or relation extension lock, and such attempts will fail. Although
>> +there is currently no reason to do so, such locks could be taken and released
>> +during parallel mode; they merely cannot be held at the start of parallel
>> +mode, since we would then fail to provide necessary mutual exclusion.
>
> Is it really true that no such locks are acquired? What about e.g. hash
> indexes? They seem to be acquiring page locks while searching.

Ah, right. I have removed "although ... do so". I was thinking that
until we allow writes it wouldn't come up, but that's wrong.

>> +Copying locks to workers is important for the avoidance of undetected
>> +deadlock between the initiating process and its parallel workers. If the
>> +initiating process holds a lock on an object at the start of parallelism,
>> +and the worker subsequently attempts to acquire a lock on that same object
>> +and blocks, this will result in an undetected deadlock, because the
>> +initiating process cannot finish the transaction (thus releasing the lock)
>> +until the worker terminates, and the worker cannot acquire the lock while
>> +the initiating process holds it. Locks which the processes involved acquire
>> +and then release during parallelism do not present this hazard; they simply
>> +force the processes involved to take turns accessing the protected resource.
>
> I don't think this is a strong guarantee. There very well can be lack of
> forward progress if they're waiting on each other in some way. Say the
> master backend holds the lock and waits on output from a worker. The
> worker then will endlessly wait for the lock to become free. A
> deadlock. Or, as another scenario, consider cooperating backends that
> both try to send tuples to each other but the queue is full. A deadlock.

The idea is that both the master and the workers are restricted to
locks which they take and release again. The idea is, specifically,
to allow syscache lookups. Taking a lock and then waiting for the
worker is no good, which is why WaitForParallelWorkersToFinish() calls
CheckForRetainedParallelLocks(). That would catch your first
scenario.

Your second scenario seems to me to be a different kind of problem.
If that comes up, what you're going to want to do is rewrite the
workers to avoid the deadlock by using non-blocking messaging. (The
recent discussions of fixing similar deadlocks where a libpq client
and a postgres server are both blocked on write while both output
buffers are full centers around similar issues.) Detecting the
deadlock and aborting is better than nothing, but not by much. In any
case, I don't think it's really this patch's job to prevent deadlocks
in code that doesn't exist today and in no way involves the lock
manager.

> To me it seems the deadlock detector has to be enhanced to be able to
> see 'waiting for' edges. Independent on how we resolve our difference of
> opinion on the copying of locks.
>
> It seems to me that this isn't all that hard: Whenever we block waiting
> for another backend we enter ourselves on the wait queue to that
> backend's virtual transaction. When finished we take the blocking
> backend off. That, afaics, should do it. Alternatively we can just
> publish what backend we're waiting for in PGPROC and make deadlock also
> look at that; but, while slightly cleaner, that looks like being more
> invasive.

That's an interesting idea. It would be more flexible than what I've
got here right now, in that parallel backends could take and retain
locks on arbitrary objects, and we'd only error out if it actually
created a deadlock, instead of erroring out because of the potential
for a deadlock under some unlikely circumstances. But it can't be
done with existing lock manager APIs - right now there is no way to
put yourself on a wait queue for a virtual transaction except to try
to acquire a conflicting lock, and that's no good because then you
aren't actually trying to read data from it. You'd need some kind of
API that says "pretend I'm waiting for this lock, but don't really
wait for it", and you'd need to be darn sure that you removed yourself
from the wait queue again before doing any other heavyweight lock
manipulation. Do you have specific thoughts on how to implement this?

>> +Copying locks to workers is also important for the avoidance of undetected
>> +deadlock involving both the parallel processe and other processes.
>
> "processe"

Fixed.

>> For
>> +example, suppose processes A1 and A2 are cooperating parallel processes and
>> +B is an unrelated process. Suppose A1 holds a lock L1 and waits for A2 to
>> +finish; B holds a lock L2 and blocks waiting for L1; and A2 blocks waiting
>> +for L2. Without lock copying, the waits-for graph is A2 -> B -> A1; there
>> +is no cycle. With lock copying, A2 will also hold the lock on L1 and the
>> +deadlock detector can find the cycle A2 -> B -> A2. As in the case of
>> +deadlocks within the parallel group, undetected deadlock occur if either A1
>> +or A2 acquired a lock after the start of parallelism and attempted to
>> +retain it beyond the end of parallelism. The prohibitions discussed above
>> +protect us against this case.
>
> I think we'd need to add more restrictions to actually make this
> guarantee anything. At the very least it would not only have to be
> prohibited to end with a lock held, but also to wait for a worker (or
> the leader) backend with a not initially granted lock.

I don't think so. The latter is safe. The other backend will finish
with the lock and then you get it afterwards.

> Am I missing something or does the copying currently break deadlock.c?
> Because afaics that'll compute lock conflicts in FindLockCycleRecurse()
> without being aware of the conflicting lock being granted to two
> backends. Won't this at least trigger spurious deadlocks? It might
> happen to be without consequence for some reason, but this would, at the
> very least, need some very careful review.

There may be a problem here, but I'm not seeing it. Initially, we're
blocking on a lock that we don't already hold, so it's not one of the
copied ones. If we follow one or more waits-for edges and arrive back
at a copied lock, that's a real deadlock and should be reported as
such.

Here is yet another version of this patch. In addition to the fixes
mentioned above, this version includes some minor rebasing around
recent commits, and also better handling of the case where we discover
that we cannot launch workers after all. This can happen because (1)
dynamic_shared_memory_type=none, (2) the maximum number of DSM
segments supported by the system configuration are already in use, or
(3) the user creates a parallel context with nworkers=0. In any of
those cases, the system will now create a backend-private memory
segment instead of a dynamic shared memory segment, and will skip
steps that don't need to be done in that case. This is obviously an
undesirable scenario. If we choose a parallel sequential scan, we
want it to launch workers and really run in parallel. Hopefully, in
case (1) or case (3), we will avoid choosing a parallel plan in the
first place, but case (2) is pretty hard to avoid completely, as we
have no idea what other processes may or may not be doing with dynamic
shared memory segments ... and, in any case, degrading to non-parallel
execution beats failing outright.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
parallel-mode-v9.patch binary/octet-stream 145.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2015-03-19 18:25:41 pg_xlogdump MSVC build script oddities
Previous Message Andres Freund 2015-03-19 18:08:26 Re: Using 128-bit integers for sum, avg and statistics aggregates