Re: parallel mode and parallel contexts

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-10 16:49:58
Message-ID: CA+TgmoY0Nj_eNx3tGQKsi3M8wf7hq0p6jXSUpsdPLPRn5O1vxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Observations:
> * Some tailing whitespace in the readme. Very nice otherwise.

Fixed. Thanks.

> * Don't like CreateParallelContextForExtension much as a function name -
> I don't think we normally equate the fact that code is located in a
> loadable library with the extension mechanism.

Suggestions for a better name? CreateParallelContextForLoadableFunction?

> * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's
> that about?

Gee, maybe I should have added a comment so I'd remember. If the
buffer size isn't MAXALIGN'd, that would be really bad, because
shm_mq_create() assumes that it's being given an aligned address.
Maybe I should add an Assert() there. If it is MAXALIGN'd but not
BUFFERALIGN'd, we might waste a few bytes of space, since
shm_toc_allocate() always allocates in BUFFERALIGN'd chunks, but I
don't think anything will actually break. Not sure if that's worth an
assert or not.

> * the plain shm calculations intentionally use mul_size/add_size to deal
> with overflows. On 32bit it doesn't seem impossible, but unlikely to
> overflow size_t.

Yes, I do that here too, though as with the plain shm calculations,
only in the estimate functions. The functions that actually serialize
stuff don't have to worry about overflow because it's already been
checked.

> * I'd s/very // i "We might be running in a very short-lived memory
> context.". Or replace it with "too".

Removed "very".

> * In +LaunchParallelWorkers, does it make sense trying to start workers
> if one failed? ISTM that's likely to not be helpful. I.e. it should
> just break; after the first failure.

It can't just break, because clearing pcxt->worker[i].error_mqh is an
essential step. I could add a flag variable that tracks whether any
registrations have failed and change the if statement to if
(!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker,
&pcxt->worker[i].bgwhandle)), if you want. I thought about doing that
but decided it was very unlikely to affect the real-world performance
of anything, so easier just to keep the code simple. But I'll change
it if you want.

> * +WaitForParallelWorkersToFinish says that it waits for workers to exit
> cleanly. To me that's ambiguous. How about "fully"?

I've removed the word "cleanly" and added a comment to more fully
explain the danger:

+ * Even if the parallel operation seems to have completed successfully, it's
+ * important to call this function afterwards. We must not miss any errors
+ * the workers may have thrown during the parallel operation, or any that they
+ * may yet throw while shutting down.

> * ParallelMain restores libraries before GUC state. Given that
> librararies can, and actually somewhat frequently do, inspect GUCs on
> load, it seems better to do it the other way round? You argue "We want
> to do this before restoring GUCs, because the libraries might define
> custom variables.", but I don't buy that. It's completely normal for
> namespaced GUCs to be present before a library is loaded? Especially
> as you then go and process_session_preload_libraries() after setting
> the GUCs.

This is a good question, but the answer is not entirely clear to me.
I'm thinking I should probably just remove
process_session_preload_libraries() altogether at this point. That
was added at a time when RestoreLibraryState() didn't exist yet, and I
think RestoreLibraryState() is a far superior way of handling this,
because surely the list of libraries that the parallel leader
*actually had loaded* is more definitive than any GUC.

Now, that doesn't answer the question about whether we should load
libraries first or GUCs. I agree that the comment's reasoning is
bogus, but I'm not sure I understand why you think the other order is
better. It *is* normal for namespaced GUCs to be present before a
library is loaded, but it's equally normal (and, I think, often more
desirable in practice) to load the library first and then set the
GUCs. Generally, I think that libraries ought to be loaded as early
as possible, because they may install hooks that change the way other
stuff works, and should therefore be loaded before that other stuff
happens.

> * Should ParallelMain maybe enter the parallel context before loading
> user defined libraries? It's far from absurd to execute code touching
> the database on library initialization...

It's hard to judge without specific examples. What kinds of things do
they do? Are they counting on a transaction being active? I would
have thought that was a no-no, since there are many instances in which
it won't be true. Also, you might have just gotten loaded because a
function stored in your library was called, so you could be in a
transaction that's busy doing something else, or deep in a
subtransaction stack, etc. It seems unsafe to do very much more than
a few syscache lookups here, even if there does happen to be a
transaction active.

> * rename ParallelMain to ParallelWorkerMain?

Sounds good. Done.

> * I think restoring snapshots needs to fudge the worker's PGXACT->xmin
> to be the minimum of the top transaction id and the
> snapshots. Otherwise if the initiating process dies badly
> (e.g. because postmaster died) the workers will continue to work,
> while other processes may remove things.

RestoreTransactionSnapshot() works the same way as the existing
import/export snapshot stuff, so that ought to be no less safe than
what we're doing already. Any other snapshots that we're restoring
had better not have an xmin lower than that one; if they do, the
master messed up. Possibly it would be a good idea to have additional
safeguards there; not sure exactly what. Perhaps RestoreSnapshot()
could assert that the xmin of the restored snapshot
follows-or-is-equal-to PGXACT->xmin? That would be safe for the first
snapshot we restore because our xmin will be InvalidTransactionId, and
after that it should check the condition you're worried about?
Thoughts?

> Also, you don't seem to
> prohibit popping the active snapshot (should that be prohibitted
> maybe?) which might bump the initiator's xmin horizon.

I think as long as our transaction snapshot is installed correctly our
xmin horizon can't advance; am I missing something?

It's generally OK to pop the active snapshot, as long as you only pop
what you pushed. In a worker, you can push additional snapshots on
the active snapshot stack and then pop them, but you can't pop the one
ParallelWorkerMain installed for you. You'll probably notice if you
do, because then the snapshot stack will be empty when you get back to
ParallelWorkerMain() and you'll fail an assertion. Similarly, the
master shouldn't pop the snapshot that was active at the start of
parallelism until parallelism is done, but again if you did you'd
probably fail an assertion later on. Generally, we haven't had much
of a problem with PushActiveSnapshot() and PopActiveSnapshot() calls
being unbalanced, so I don't really think this is an area that needs
especially strong cross-checks. At least not unless we get some
evidence that this is a more-common mistake in code that touches
parallelism than it is in general.

> * Comment about 'signal_handler' in +HandleParallelMessageInterrupt
> is outdated.

Removed.

> * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker
> exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally
> not sure it's wise to use this faux FE/BE protocol here...

Well, I'm not sure about that either and never have been, but I was
even less sure inventing a new one was any better. We might need a
few new protocol messages (or to reuse a few existing ones for other
things) but being able to reuse the existing format for ErrorResponse,
NoticeResponse, etc. seems like a pretty solid win. Those are
reasonably complex message formats and reimplementing them for no
reason seems like a bad idea.

Terminate is 'X', not 'T', and it's a frontend-only message. The
worker is speaking the backend half of the protocol. We could use it
anyway; that wouldn't be silly. I picked ReadyForQuery because that's
what the backend sends when it is completely done processing
everything that the user most recently requested, which seems
defensible here.

> * HandlingParallelMessages looks dead.

Good catch. Removed.

> * ParallelErrorContext has the wrong comment.

Doh. Fixed.

> * ParallelErrorContext() provides the worker's pid in the context
> message. I guess that's so it's easier to interpret when sent to the
> initiator? It'll look odd when logged by the failing process.

Yes, that's why. Regarding logging, true. I guess the master could
add the context instead, although making sure the PID is available
looks pretty annoying. At the time we establish the queue, the PID
isn't known yet, and by the time we read the error from it, the worker
might be gone, such that we can't read its PID. To fix, we'd have to
invent a new protocol message that means "here's my PID". Another
alternative is to just say that the error came from a parallel worker
(e.g. "while executing in parallel worker") and not mention the PID,
but that seems like it'd be losing possibly-useful information.

> * We now have pretty much the same handle_sigterm in a bunch of
> places. Can't we get rid of those? You actually probably can just use
> die().

Good idea. Done.

> * The comments in xact.c above XactTopTransactionId pretty much assume
> that the reader knows that that is about parallel stuff.

What would you suggest? The comment begins "Only a single
TransactionStateData is placed on the parallel worker's state stack",
which seems like a pretty clear way of giving the user a hint that we
are talking about parallel stuff. An older version of the patch was
much less clear, but I thought I'd remedied that.

> * I'm a bit confused by the fact that Start/CommitTransactionCommand()
> emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas
> ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be
> pretty much impossible to hit a ReleaseSavepoint() with a parallel
> transaction in progress? We'd have had to end the previous transaction
> command while parallel stuff was in progress - right?
> (Internal/ReleaseCurrentSubTransaction is different, that's used in code)

It's pretty simple to hit ReleaseSavepoint() while a transaction is in
progress. It's pretty much directly SQL-callable, so a PL function
run in a parallel worker could easily hit it, or anything that uses
SPI. There are similar checks in e.g. EndTransaction(), but you can't
invoke CommitTransactionCommand() directly.

> * Why are you deviating from the sorting order used in other places for
> ParallelCurrentXids? That seems really wierd, especially as we use
> something else a couple lines down. Especially as you actually seem to
> send stuff in xidComparator order?

The transaction IDs have to be sorted into some order so that they can
be binary-searched, and this seemed simplest. xidComparator sorts in
numerical order, not transaction-ID order, so that's how we send them.
That turns out to be convenient anyway, because binary-search on a
sorted array of integers is really simple. If we sent them sorted in
transaction-ID order we'd have to make sure that the reference
transaction ID was the same for both backends, which might not be
hard, but I don't see how it would be better than this.

> * I don't think skipping AtEOXact_Namespace() entirely if parallel is a
> good idea. That function already does some other stuff than cleaning
> up temp tables. I think you should pass in parallel and do the skip in
> there.

That's a very good point. Fixed.

> * Start/DndParallelWorkerTransaction assert the current state, whereas
> the rest of the file FATALs in that case. I think it'd actually be
> good to be conservative and do the same in this case.

Well, actually, StartTransaction() does this:

if (s->state != TRANS_DEFAULT)
elog(WARNING, "StartTransaction while in %s state",
TransStateAsString(s->state));

I could copy and paste that code into StartParallelWorkerTransaction()
and changing WARNING to FATAL, but the first thing
StartParallelWorkerTransaction() does is call StartTransaction(). It
seems pretty stupid to have two identical tests that differ only in
their log level. The same considerations apply to
EndParalellWorkerTransaction() and CommitTransaction().

A worthwhile question is why somebody thought that it was a good idea
for the log level there to be WARNING rather than FATAL. But I don't
think it's this patch's job to second-guess that decision.

> * You're manually passing function names to
> PreventCommandIfParallelMode() in a fair number of cases. I'd either
> try and keep the names consistent with what the functions are actually
> called at the sql level (adding the types in the parens) or just use
> PG_FUNCNAME_MACRO to keep them correct.

I think putting the type names in is too chatty; I'm not aware we use
that style in other error messages. We don't want to lead people to
believe that only the form with the particular argument types they
used is not OK.

PG_FUNCNAME_MACRO will give us the C name, not the SQL name.

> * Wait. You now copy all held relation held "as is" to the standby? I
> quite doubt that's a good idea, and it's not my reading of the
> conclusions made in the group locking thread. At the very least this
> part needs to be extensively documented. And while
> LockAcquireExtended() refers to
> src/backend/access/transam/README.parallel for details I don't see
> anything pertinent in there. And the function header sounds like the
> only difference is the HS logging - not mentioning that it essentially
> disables lock queuing entirely.
>
> This seems unsafe (e.g. consider if the initiating backend died and
> somebody else acquired the lock, possible e.g. if postmaster died) and
> not even remotely enough discussed. I think this should be removed
> from the patch for now.

If it's broken, we need to identify what's wrong and fix it, not just
rip it out. It's possible that something is broken with that code,
but it's dead certain that something is broken without it:

rhaas=# select parallel_count('pgbench_accounts', 1);
NOTICE: PID 57956 counted 2434815 tuples
NOTICE: PID 57957 counted 1565185 tuples
CONTEXT: parallel worker, pid 57957
parallel_count
----------------
4000000
(1 row)

rhaas=# begin;
BEGIN
rhaas=# lock pgbench_accounts;
LOCK TABLE
rhaas=# select parallel_count('pgbench_accounts', 1);
NOTICE: PID 57956 counted 4000000 tuples

...and then it hangs forever.

On the specific issues:

1. I agree that it's very dangerous for the parallel backend to
acquire the lock this way if the master no longer holds it.
Originally, I was imagining that there would be no interlock between
the master shutting down and the worker starting up, but you and
others convinced me that was a bad idea. So now transaction commit or
abort waits for all workers to be gone, which I think reduces the
scope of possible problems here pretty significantly. However, it's
quite possible that it isn't airtight. One thing we could maybe do to
make it safer is pass a pointer to the initiator's PGPROC. If we get
the lock via the fast-path we are safe anyway, but if we have to
acquire the partition lock, then we can cross-check that the
initiator's lock is still there. I think that would button this up
pretty tight.

2. My reading of the group locking discussions was that everybody
agreed that the originally-proposed group locking approach, which
involved considering locks from the same locking group as mutually
non-conflicting, was not OK. Several specific examples were offered -
e.g. it's clearly not OK for two backends to extend a relation at the
same time just because the same locking group. So I abandoned that
approach. When I instead proposed the approach of copying only the
locks that the master *already* had at the beginning of parallelism
and considering *only those* as mutually conflicting, I believe I got
several comments to the effect that this was "less scary".
Considering the topic area, I'm not sure I'm going to do any better
than that.

3. I welcome proposals for other ways of handling this problem, even
if they restrict the functionality that can be offered. For example,
a proposal that would make parallel_count revert to single-threaded
mode but terminate without an indefinite wait would be acceptable to
me, provided that it offers some advantage in safety and security over
what I've already got. A proposal to make it parallel_count error out
in the above case would not be acceptable to me; the planner must not
generate parallel plans that will sometimes fail unexpectedly at
execution-time. I generally believe that we will be much happier if
application programmers need not worry about the failure of parallel
workers to obtain locks already held by the master; some such failure
modes may be very complex and hard to predict. The fact that the
current approach handles the problem entirely within the lock manager,
combined with the fact that it is extremely simple, is therefore very
appealing to me. Nonetheless, a test case that demonstrates this
approach falling down badly would force a rethink; do you have one?
Or an idea about what it might look like?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-02-10 17:13:06 Re: The return value of allocate_recordbuf()
Previous Message Tom Lane 2015-02-10 16:08:55 Fixing pg_dump's heuristic about casts