Re: pg_promote not marked as parallel-restricted in pg_proc.dat

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_promote not marked as parallel-restricted in pg_proc.dat
Date: 2018-11-01 16:59:47
Message-ID: CA+Tgmoa3uMW0jzv9V_d8-8n6LKvPeXCOgRx_BQeR2HvamMUEGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 31, 2018 at 8:43 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> Okay, but likely we would not want to signal the postmaster
> unnecessarily, no? FALLBACK_PROMOTE_SIGNAL_FILE gets discarded if
> promotion is triggered more than once, but that does not like a sane
> thing to do if not necessary.

Uh, what in the world does that have to do with the topic at hand?
Parallel-safety has nothing to do with preventing functions from being
called "unnecessarily" or "more than once".

I don't like the fact that Tom keeps arguing for marking things
parallel-unsafe without any real justification. During the
development of parallel query, he argued that no such concept should
exist, because everything should be work in parallel mode just like it
does otherwise. I eventually convinced him (or so it seemed) that we
had to have the concept because it was impractical to eliminate all
the corner cases where things were parallel-unsafe in the short/medium
term. However, in the long term, the fact that we have
parallel-restricted and parallel-unsafe is an implementation
restriction that we should be trying to eliminate. Making the list of
parallel-unsafe things longer makes that goal harder to achieve,
especially when the marking is applied not out of any real
justification based on what the code does, as was my intent, but based
on an ill-defined feeling of unease.

Eventually, after some future round of infrastructure improvements,
we're going to end up having discussions about changing things that
are marked parallel-unsafe or parallel-restricted to parallel-safe,
and when somebody can't find any reason why the existing markings are
like that in the first place, they're going to use that as
justification for not changing them ("there must've been a reason!").
Parallel-safety shouldn't be some kind of quasi-religious thing where
markings bleed from a function that is properly marked to one with a
similar name, or irrational values are applied in the name of
preventing unspecified or only-vaguely-connected evils. It should be
based on things that can be determined scientifically, like "hey, does
this code access any global variables other than those which are
automatically synchronized between the master and the workers?" and
"hey, does this code ever attempt
heap_insert/heap_update/heap_delete?" and "hey, does this ever call
other user-defined code that might itself be parallel-unsafe?".

The right way to figure out the appropriate parallel-safety marking
for a function is to read the code and see what it does. Now you
might miss something, as with any code-reading exercise, but if you
don't, then you should come up with the right answer. In the case of
pg_promote, it calls out to the following functions:

RecoveryInProgress - checks shared memory state. equally available to
a worker as to the leader.
AllocateFile/FreeFile - no problem here; these work fine in workers
just like any other backend.
kill, unlink - same thing
ResetLatch, WaitLatch, CHECK_FOR_INTERRUPTS() - all fine in a worker;
in fact, used as part of the worker machinery
ereport - works anywhere, worker machinery propagates errors back to leader

That's all. There are no obvious references to global variables or
anything here. So, it looks to be parallel-safe.

If you do the same analysis for pg_start_backup(), you'll immediately
notice that it calls get_backup_status(), and if you look at that
function, you'll see that it returns the value of a global variable.
If you check parallel.c, you'll find that that global variable is not
one of the special ones that gets synchronized between a leader and
the workers. Therefore, if you ran this function in a worker, it
might do the wrong thing, because it might get the wrong value of that
global variable. So it's at least (and in fact exactly)
parallel-restricted. It doesn't need to be parallel-restricted
because it's scary in some ill-defined way or any of these other
reasons advanced on this thread -- it needs to be parallel-restricted
because it *relies on a global variable that is not synchronized to
parallel workers*. If somebody writes code to move that state out of
a global variables and into the main shared memory segment or to a DSM
shared between the leader and the workers, then it can be marked
parallel-safe (unless, for example, it also depends on OTHER global
variables that are NOT moved into shared storage). Otherwise not.

I hope I'm making sense here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-11-01 17:01:55 Re: New vacuum option to do only freezing
Previous Message Bossart, Nathan 2018-11-01 16:32:03 Re: New vacuum option to do only freezing