Re: parallel mode and parallel contexts

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: parallel mode and parallel contexts
Date: 2015-02-11 01:45:28
Message-ID: 20150211014528.GQ21017@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-02-10 11:49:58 -0500, Robert Haas wrote:
> On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > * Don't like CreateParallelContextForExtension much as a function name -
> > I don't think we normally equate the fact that code is located in a
> > loadable library with the extension mechanism.
>
> Suggestions for a better name? CreateParallelContextForLoadableFunction?

*ExternalFunction maybe? That's dfmgr.c calls it.

> > * the plain shm calculations intentionally use mul_size/add_size to deal
> > with overflows. On 32bit it doesn't seem impossible, but unlikely to
> > overflow size_t.
>
> Yes, I do that here too, though as with the plain shm calculations,
> only in the estimate functions. The functions that actually serialize
> stuff don't have to worry about overflow because it's already been
> checked.

I thought I'd seen some estimation functions that didn't use it. But
apparently I was wrong.

> > * In +LaunchParallelWorkers, does it make sense trying to start workers
> > if one failed? ISTM that's likely to not be helpful. I.e. it should
> > just break; after the first failure.
>
> It can't just break, because clearing pcxt->worker[i].error_mqh is an
> essential step. I could add a flag variable that tracks whether any
> registrations have failed and change the if statement to if
> (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker,
> &pcxt->worker[i].bgwhandle)), if you want. I thought about doing that
> but decided it was very unlikely to affect the real-world performance
> of anything, so easier just to keep the code simple. But I'll change
> it if you want.

I think it'd be better.

> > * ParallelMain restores libraries before GUC state. Given that
> > librararies can, and actually somewhat frequently do, inspect GUCs on
> > load, it seems better to do it the other way round? You argue "We want
> > to do this before restoring GUCs, because the libraries might define
> > custom variables.", but I don't buy that. It's completely normal for
> > namespaced GUCs to be present before a library is loaded? Especially
> > as you then go and process_session_preload_libraries() after setting
> > the GUCs.
>
> This is a good question, but the answer is not entirely clear to me.
> I'm thinking I should probably just remove
> process_session_preload_libraries() altogether at this point. That
> was added at a time when RestoreLibraryState() didn't exist yet, and I
> think RestoreLibraryState() is a far superior way of handling this,
> because surely the list of libraries that the parallel leader
> *actually had loaded* is more definitive than any GUC.

That sounds like a good idea to me.

> Now, that doesn't answer the question about whether we should load
> libraries first or GUCs. I agree that the comment's reasoning is
> bogus, but I'm not sure I understand why you think the other order is
> better. It *is* normal for namespaced GUCs to be present before a
> library is loaded, but it's equally normal (and, I think, often more
> desirable in practice) to load the library first and then set the
> GUCs.

Well, it's pretty much never the case that the library is loaded before
postgresql.conf gucs, right? A changed postgresql.conf is the only
exception I can see. Neither is it the normal case for
session|local_preload_libraries. Not even when GUCs are loaded via
pg_db_role_setting or the startup packet...

> Generally, I think that libraries ought to be loaded as early
> as possible, because they may install hooks that change the way other
> stuff works, and should therefore be loaded before that other stuff
> happens.

While that may be desirable I don't really see a reason for this to be
treated differently for worker processes than the majority of cases
otherwise.

Anyway, I think this is a relatively minor issue.

> > * Should ParallelMain maybe enter the parallel context before loading
> > user defined libraries? It's far from absurd to execute code touching
> > the database on library initialization...
>
> It's hard to judge without specific examples. What kinds of things do
> they do?

I've seen code filling lookup caches and creating system objects
(including tables and extensions).

> Are they counting on a transaction being active?

> I would have thought that was a no-no, since there are many instances
> in which it won't be true. Also, you might have just gotten loaded
> because a function stored in your library was called, so you could be
> in a transaction that's busy doing something else, or deep in a
> subtransaction stack, etc. It seems unsafe to do very much more than
> a few syscache lookups here, even if there does happen to be a
> transaction active.

The only reason I'd like it to be active is because that'd *prohibit*
doing the crazier stuff. There seems little reason to not da it under
the additional protection against crazy things that'd give us?

> > * I think restoring snapshots needs to fudge the worker's PGXACT->xmin
> > to be the minimum of the top transaction id and the
> > snapshots. Otherwise if the initiating process dies badly
> > (e.g. because postmaster died) the workers will continue to work,
> > while other processes may remove things.
>
> RestoreTransactionSnapshot() works the same way as the existing
> import/export snapshot stuff, so that ought to be no less safe than
> what we're doing already.

Well, ExportSnapshot()/Import has quite a bit more restrictions than
what you're doing... Most importantly it cannot import in transactions
unless using read committed isolation, forces xid assignment during
export, forces the old snapshot to stick around for the whole
transaction and only works on a primary. I'm not actually sure it makes
a relevant difference, but it's certainly worth calling attention to.

The fuding I was wondering about certainly is unnecessary though -
GetSnapshotData() has already performed it...

I'm not particularly awake right now and I think this needs a closer
look either by someone awake... I'm not fully convinced this is safe.

> > Also, you don't seem to
> > prohibit popping the active snapshot (should that be prohibitted
> > maybe?) which might bump the initiator's xmin horizon.
>
> I think as long as our transaction snapshot is installed correctly our
> xmin horizon can't advance; am I missing something?

Maybe I'm missing something, but why would that be the case in a read
committed session? ImportSnapshot() only ever calls
SetTransactionSnapshot it such a case (why does it contain code to cope
without it?), but your patch doesn't seem to guarantee that.

But as don't actually transport XactIsoLevel anywhere (omission
probably?) that seems to mean that the SetTransactionSnapshot() won't do
much, even if the source transaction is repeatable read.

Now the thing that actually does give some guarantees is that you
immediately afterwards restore the active snapshot and do a
PushActiveSnapshot(), which'll prevent MyPgXact->xmin from changing
because SnapshotResetXmin() refuses to do anything if there's an
ActiveSnapshot. Seems a bit comlicated and fragile though.

> > * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker
> > exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally
> > not sure it's wise to use this faux FE/BE protocol here...
>
> Well, I'm not sure about that either and never have been, but I was
> even less sure inventing a new one was any better. We might need a
> few new protocol messages (or to reuse a few existing ones for other
> things) but being able to reuse the existing format for ErrorResponse,
> NoticeResponse, etc. seems like a pretty solid win. Those are
> reasonably complex message formats and reimplementing them for no
> reason seems like a bad idea.
>
> Terminate is 'X', not 'T'

Oops, yes.

> and it's a frontend-only message. The worker is speaking the backend
> half of the protocol. We could use it anyway; that wouldn't be silly.

Even if it's a frontend only one it doesn't seem like a bad idea. I've
occasionally wished the backend would send explicit termination messages
in a bunch of scenarios. I'm a bit tired of seing:

FATAL: 57P01: terminating connection due to administrator command
LOCATION: ProcessInterrupts, postgres.c:2888
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

wheen the session was terminated while a query is active. A explicit
termination message seems like a nicer solution than interpreting the
FATAL severity.

> I picked ReadyForQuery because that's
> what the backend sends when it is completely done processing
> everything that the user most recently requested, which seems
> defensible here.

I'm pretty sure that we're going to reuse workers within a parallel
query at some point and ready for query seems like a nicer code for
saying 'finished with my work, give me the next thing'.

> > * ParallelErrorContext() provides the worker's pid in the context
> > message. I guess that's so it's easier to interpret when sent to the
> > initiator? It'll look odd when logged by the failing process.
>
> Yes, that's why. Regarding logging, true. I guess the master could
> add the context instead, although making sure the PID is available
> looks pretty annoying. At the time we establish the queue, the PID
> isn't known yet, and by the time we read the error from it, the worker
> might be gone, such that we can't read its PID. To fix, we'd have to
> invent a new protocol message that means "here's my PID".

Hm. Or we could attach the pid to the error message in that case - just
like there already is schema_name etc. Or, to stay more in the FE/BE
vibe - we could just send a 'K' message at startup which sends
MyProcPid, MyCancelKey in normal connections.

> > * The comments in xact.c above XactTopTransactionId pretty much assume
> > that the reader knows that that is about parallel stuff.
>
> What would you suggest? The comment begins "Only a single
> TransactionStateData is placed on the parallel worker's state stack",
> which seems like a pretty clear way of giving the user a hint that we
> are talking about parallel stuff.

Replacing "Only" by "When running as parallel worker we only place a
..." would already help. To me the comment currently readss like it
desperately wishes to be located in the function initiating parallelism
rather than global file scope. Maybe it's lonely or such.

> > * I'm a bit confused by the fact that Start/CommitTransactionCommand()
> > emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas
> > ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be
> > pretty much impossible to hit a ReleaseSavepoint() with a parallel
> > transaction in progress? We'd have had to end the previous transaction
> > command while parallel stuff was in progress - right?
> > (Internal/ReleaseCurrentSubTransaction is different, that's used in code)
>
> It's pretty simple to hit ReleaseSavepoint() while a transaction is in
> progress. It's pretty much directly SQL-callable, so a PL function
> run in a parallel worker could easily hit it, or anything that uses
> SPI.

SPI doesn't really work across subtransaction boundaries, so it really
should never be called that way. Which is way it (and thus the PLs)
return SPI_ERROR_TRANSACTION in case you try to execute it. If it's
possible to hit it, we have a problem - I think it'd pretty much lead to
rampant memory clobbering and such. Now, I'm not against providing a
error check, I was just surprised about the verbosity of the different
locations.

> > * Why are you deviating from the sorting order used in other places for
> > ParallelCurrentXids? That seems really wierd, especially as we use
> > something else a couple lines down. Especially as you actually seem to
> > send stuff in xidComparator order?
>
> The transaction IDs have to be sorted into some order so that they can
> be binary-searched, and this seemed simplest. xidComparator sorts in
> numerical order, not transaction-ID order, so that's how we send them.

Forget that bit. Was tired. I'm not sure what I thought.

> > * Start/DndParallelWorkerTransaction assert the current state, whereas
> > the rest of the file FATALs in that case. I think it'd actually be
> > good to be conservative and do the same in this case.
>
> Well, actually, StartTransaction() does this:
>
> if (s->state != TRANS_DEFAULT)
> elog(WARNING, "StartTransaction while in %s state",
> TransStateAsString(s->state));

But StartTransactionCommand does
elog(ERROR, "StartTransactionCommand: unexpected state %s",
BlockStateAsString(s->blockState));
and CommitTransactionCommand does
elog(FATAL, "CommitTransactionCommand: unexpected state %s",
BlockStateAsString(s->blockState));
and some more. These only deal with blockState though...

> A worthwhile question is why somebody thought that it was a good idea
> for the log level there to be WARNING rather than FATAL.

Yea, it's really rather odd. Everytime I've seen those WARNINGS, e.g. in
the recent "hung backends stuck in spinlock" things got even more
pearshaped shortly afterwards.

> But I don't think it's this patch's job to second-guess that decision.

Fair enough.

> > * You're manually passing function names to
> > PreventCommandIfParallelMode() in a fair number of cases. I'd either
> > try and keep the names consistent with what the functions are actually
> > called at the sql level (adding the types in the parens) or just use
> > PG_FUNCNAME_MACRO to keep them correct.
>
> I think putting the type names in is too chatty; I'm not aware we use
> that style in other error messages. We don't want to lead people to
> believe that only the form with the particular argument types they
> used is not OK.

> PG_FUNCNAME_MACRO will give us the C name, not the SQL name.

Your manual ones don't either, that's what made me
complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL
level. Instead they use the C name + parens.

> > * Wait. You now copy all held relation held "as is" to the standby? I
> > quite doubt that's a good idea, and it's not my reading of the
> > conclusions made in the group locking thread. At the very least this
> > part needs to be extensively documented. And while
> > LockAcquireExtended() refers to
> > src/backend/access/transam/README.parallel for details I don't see
> > anything pertinent in there. And the function header sounds like the
> > only difference is the HS logging - not mentioning that it essentially
> > disables lock queuing entirely.
> >
> > This seems unsafe (e.g. consider if the initiating backend died and
> > somebody else acquired the lock, possible e.g. if postmaster died) and
> > not even remotely enough discussed. I think this should be removed
> > from the patch for now.
>
> If it's broken, we need to identify what's wrong and fix it, not just
> rip it out. It's possible that something is broken with that code,
> but it's dead certain that something is broken without it:
>
> rhaas=# select parallel_count('pgbench_accounts', 1);
> NOTICE: PID 57956 counted 2434815 tuples
> NOTICE: PID 57957 counted 1565185 tuples
> CONTEXT: parallel worker, pid 57957
> parallel_count
> ----------------
> 4000000
> (1 row)
>
> rhaas=# begin;
> BEGIN
> rhaas=# lock pgbench_accounts;
> LOCK TABLE
> rhaas=# select parallel_count('pgbench_accounts', 1);
> NOTICE: PID 57956 counted 4000000 tuples
>
> ...and then it hangs forever.

Which is *not* a good example for the problem. Your primary reasoning
for needing something more than sharing the locks that we know the
individual query is going to need (which we acquire in the parse
analzye/planner/executor) is that we can't predict what some random code
does. Now, I still don't think that argument should hold too much sway
because we'll only be able to run carefully controlled code
*anyway*. But *if* we take it as reasoning for doing more than granting
the locks we need for the individual query, we *still* need to cope with
exactly the variants of the above invisible deadlock. You can still can
call a function acquiring some form of lock on either side.

If you'd, as I've argued for before, provide a API that granted workers
precisely the locks needed for the execution of a certain type of
parallel action, I'd be happy. I.e. regular parallel queries would
transport an extended version of what InitPlan() does with relation
lock.

> On the specific issues:
>
> 1. I agree that it's very dangerous for the parallel backend to
> acquire the lock this way if the master no longer holds it.
> Originally, I was imagining that there would be no interlock between
> the master shutting down and the worker starting up, but you and
> others convinced me that was a bad idea. So now transaction commit or
> abort waits for all workers to be gone, which I think reduces the
> scope of possible problems here pretty significantly. However, it's
> quite possible that it isn't airtight. One thing we could maybe do to
> make it safer is pass a pointer to the initiator's PGPROC. If we get
> the lock via the fast-path we are safe anyway, but if we have to
> acquire the partition lock, then we can cross-check that the
> initiator's lock is still there. I think that would button this up
> pretty tight.

That'd certainly make me feel less bad about it.

> 2. My reading of the group locking discussions was that everybody
> agreed that the originally-proposed group locking approach, which
> involved considering locks from the same locking group as mutually
> non-conflicting, was not OK. Several specific examples were offered -
> e.g. it's clearly not OK for two backends to extend a relation at the
> same time just because the same locking group. So I abandoned that
> approach. When I instead proposed the approach of copying only the
> locks that the master *already* had at the beginning of parallelism
> and considering *only those* as mutually conflicting, I believe I got
> several comments to the effect that this was "less scary".
> Considering the topic area, I'm not sure I'm going to do any better
> than that.

It surely is less scary, but still darn scary. It's just friggin hard to
have a mental model where (self) exclusive locks suddenly aren't that
anymore. It'll get more and more dangerous the more we start to relax
restrictions around parallelism - and that *will* come.

This concern does not only apply to user level commands, but also our
own C code. We will find more and more reasons to parallelize commands
and if we will have problems with suddenly not having a chance to
prevent parallelism during certain actions. Say we parallelize VACUUM
(grand idea for the index scans!) and someone wants to also improve the
truncation. Suddenly the AEL during truncation doesn't give us
guarantees anymore. There'll be many more of these, and we can't really
reason about them because we used to working locks.

> 3. I welcome proposals for other ways of handling this problem, even
> if they restrict the functionality that can be offered. For example,
> a proposal that would make parallel_count revert to single-threaded
> mode but terminate without an indefinite wait would be acceptable to
> me, provided that it offers some advantage in safety and security over
> what I've already got.

I think the above example where we only grant the locks required for a
specific thing and only on the relevant severity would already be a big
improvement. Even better would be to to use the computed set of locks to
check whether workers could acquire them and refuse paralellism in that
case.

Another thing to do is to mark the lock, both in the master and workers,
as not effectively being of the original severity anymore. If the own
process again tries to acquire the lock on a heavier severity than what
was needed at the time of query execution, error out. At least until
parallel mode has confirmed to have ended. That should pretty much never
happen.

I don't see how we can avoid teaching the deadlock detector about all
these silent deadlocks in any cases. By your own reasoning.

> A proposal to make it parallel_count error out
> in the above case would not be acceptable to me; the planner must not
> generate parallel plans that will sometimes fail unexpectedly at
> execution-time. I generally believe that we will be much happier if
> application programmers need not worry about the failure of parallel
> workers to obtain locks already held by the master; some such failure
> modes may be very complex and hard to predict. The fact that the
> current approach handles the problem entirely within the lock manager,
> combined with the fact that it is extremely simple, is therefore very
> appealing to me.

It's trivial to be simple and unsafe...

> Nonetheless, a test case that demonstrates this approach falling down
> badly would force a rethink; do you have one? Or an idea about what
> it might look like?

No, I don't have one handy. I have the very strong gut feeling that this
would introduce a severe architectural debt that we won't be able to get
rid of afterwards, even if it proves to be a really bad idea. To me
it's taking a shortcut directly through the hard problems.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Wong 2015-02-11 01:49:18 ibm system z in the buildfarm
Previous Message Robert Haas 2015-02-11 01:34:33 Re: Typo in logicaldecoding.sgml