"A.M." <agentm(at)themactionfaction(dot)com> writes:
> On 06/26/2012 07:30 PM, Tom Lane wrote:
>>> I solved this via fcntl locking.
>> No, you didn't, because fcntl locks aren't inherited by child processes.
>> Too bad, because they'd be a great solution otherwise.
> You claimed this last time and I replied:
> "I address this race condition by ensuring that a lock-holding violator
> is the postmaster or a postmaster child. If such as condition is
> detected, the child exits immediately without touching the shared
> memory. POSIX shmem is inherited via file descriptors."
> This is possible because the locking API allows one to request which PID
> violates the lock. The child expects the lock to be held and checks that
> the PID is the parent. If the lock is not held, that means that the
> postmaster is dead, so the child exits immediately.
OK, I went back and re-read the original patch, and I now agree that
something like this is possible --- but I don't like the way you did
it. The dependence on particular PIDs seems both unnecessary and risky.
The key concept here seems to be that the postmaster first stakes a
claim on the data directory by exclusive-locking a lock file. If
successful, it reduces that lock to shared mode (which can be done
atomically, according to the SUS fcntl specification), and then holds
the shared lock until it exits. Spawned children will not initially
have a lock, but what they can do is attempt to acquire shared lock on
the lock file. If fail, exit. If successful, *check to see that the
parent postmaster is still alive* (ie, getppid() != 1). If so, the
parent must have been continuously holding the lock, and the child has
successfully joined the pool of shared lock holders. Otherwise bail
out without having changed anything. It is the "parent is still alive"
check, not any test on individual PIDs, that makes this work.
There are two concrete reasons why I don't care for the
GetPIDHoldingLock() way. Firstly, the fact that you can get a blocking
PID from F_GETLK isn't an essential part of the concept of file locking
IMO --- it's just an incidental part of this particular API. May I
remind you that the reason we're stuck on SysV shmem in the first place
is that we decided to depend on an incidental part of that API, namely
nattch? I would like to not require file locking to have any semantics
more specific than "a process can hold an exclusive or a shared lock on
a file, which is auto-released at process exit". Secondly, in an NFS
world I don't believe that the returned l_pid value can be trusted for
anything. If it's a PID from a different machine then it might
accidentally conflict with one on our machine, or not.
Reflecting on this further, it seems to me that the main remaining
failure modes are (1) file locking doesn't work, or (2) idiot DBA
manually removes the lock file. Both of these could be ameliorated
with some refinements to the basic idea. For (1), I suggest that
we tweak the startup process (only) to attempt to acquire exclusive lock
on the lock file. If it succeeds, we know that file locking is broken,
and we can complain. (This wouldn't help for cases where cross-machine
locking is broken, but I see no practical way to detect that.)
For (2), the problem really is that the proposed patch conflates the PID
file with the lock file, but people are conditioned to think that PID
files are removable. I suggest that we create a separate, permanently
present file that serves only as the lock file and doesn't ever get
modified (it need have no content other than the string "Don't remove
this!"). It'd be created by initdb, not by individual postmaster runs;
indeed the postmaster should fail if it doesn't find the lock file
already present. The postmaster PID file should still exist with its
current contents, but it would serve mostly as documentation and as
server-contact information for pg_ctl; it would not be part of the data
directory locking mechanism.
I wonder whether this design can be adapted to Windows? IIRC we do
not have a bulletproof data directory lock scheme for Windows.
It seems like this makes few enough demands on the lock mechanism
that there ought to be suitable primitives available there too.
regards, tom lane
In response to
pgsql-hackers by date
|Next:||From: David Fetter||Date: 2012-06-27 02:11:12|
|Subject: Re: Catalog/Metadata consistency during changeset
extraction from wal|
|Previous:||From: Robert Haas||Date: 2012-06-27 00:44:46|
|Subject: Re: Posix Shared Mem patch|