pg_listener entries deleted under heavy NOTIFY load only on Windows

From: Radu Ilie <rilie(at)wsi(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2010-01-22 21:30:26
Message-ID: 1264195826.2328.159.camel@arc-dev2.wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On a Windows server under heavy load of NOTIFY events, entries in
pg_listener table for some events are deleted. It is like UNLISTEN was
called.

PostgreSQL version: 8.3.9.
Operating System: Windows XP.

PostgreSQL believes that if it fails to notify a listener (by signaling
the respective backend), then the backend doesn't exist anymore and so
it should get rid of the pg_listener entry. The relevant code is in
src/backend/commands/async.c, function Send_Notify:

if (kill(listenerPID, SIGUSR2) < 0)
{
/*
* Get rid of pg_listener entry if it refers to a PID that no
* longer exists. Presumably, that backend crashed without
* deleting its pg_listener entries. This code used to only
* delete the entry if errno==ESRCH, but as far as I can see
* we should just do it for any failure (certainly at least
* for EPERM too...)
*/
simple_heap_delete(lRel, &lTuple->t_self);
}

The problem is that under Windows, kill can fail even if the process is
still alive. PostgreSQL uses named pipes under Windows to send signals
to backends. The present implementation has a bug that causes a client
to fail to write data to the named pipe, even though the server process
is alive. This is because the server doesn't maintain the named pipe at
all times. The bug is present in file src/backend/port/win32/signal.c,
function pg_signal_thread.

The server code stays in a loop in which it continuously creates an
instance of the named pipe (via CreateNamedPipe) and waits for a client
process to connect (via ConnectNamedPipe). Once a client connects, the
communication with the client is handled in a new thread, with the
thread procedure pg_signal_dispatch_thread. This function is very
simple: it reads one byte from the named pipe, it writes it back and
(very important) closes the handle to the named pipe instance. The main
loop creates another instance of the named pipe and waits for another
client to connect.

Now imagine that the server is under heavy load. There are dozens of
backends and threads running and the CPU usage is close to 100%. The
following succession of events is possible:

1. Server signal handling thread (in function pg_signal_thread) creates
the first, one and only instance of the named pipe via CreateNamedPipe.
2. Server code starts waiting for clients to connect with
ConnectNamedPipe.
3. Client wishes to make a transaction on the named pipe and calls
CallNamedPipe (in file src/port/kill.c, function pgkill).
4. Server code returns from ConnectNamedPipe. It creates a new thread
with the thread procedure pg_signal_dispatch_thread.
5. The signal dispatch thread is scheduled for execution and it runs to
completion. As you can see, the last thing it does related to the named
pipe is to close the handle via CloseHandle (in function
pg_signal_dispatch_thread). This closes the last instance of the named
pipe. The named pipe is gone. There is no more named pipe. The signal
handling thread was not yet scheduled by the operating system for
execution and thus didn't have an opportunity to call CreateNamedPipe.
6. Another client (or the same one, it doesn't matter) tries to write to
the named pipe via CallNamedPipe. The call returns ERROR_FILE_NOT_FOUND,
because the named pipe is gone. The client believes the backend is gone
and it removes the entry from pg_listener.
7. The signal handling thread (in function pg_signal_thread) is finally
scheduled for execution and it calls CreateNamedPipe. We now have an
instance of the named pipe available.

So we end up with the server backend alive, the named pipe is there, but
the row is gone from pg_listener. This is easy to reproduce under
Windows. I used the scripts posted by Steve Marshall in a similar thread
from 01/15/2009 and the problem appears within one minute all the time.
For testing I used a Windows XP machine with 2 cores and 2GB of RAM. The
CPU usage was over 70% during the trials.

The solution is to create a new instance of the named pipe before
launching the signal dispatch thread. This means changing the code in
src/backend/port/win32/signal.c to look like this:

@@ -250,6 +250,7 @@
{
char pipename[128];
HANDLE pipe = pgwin32_initial_signal_pipe;
+ HANDLE new_pipe = pgwin32_initial_signal_pipe;

snprintf(pipename, sizeof(pipename), "\\\\.\\pipe\\pgsignal_%u",
GetCurrentProcessId());

@@ -275,6 +276,10 @@
fConnected = ConnectNamedPipe(pipe, NULL) ? TRUE :
(GetLastError() == ERROR_PIPE_CONNECTED);
if (fConnected)
{
+ new_pipe = CreateNamedPipe(pipename,
PIPE_ACCESS_DUPLEX,
+ PIPE_TYPE_MESSAGE |
PIPE_READMODE_MESSAGE | PIPE_WAIT,
+
PIPE_UNLIMITED_INSTANCES, 16, 16, 1000, NULL);
+
hThread = CreateThread(NULL, 0,

(LPTHREAD_START_ROUTINE) pg_signal_dispatch_thread,

(LPVOID) pipe, 0, NULL);
@@ -288,8 +293,7 @@
/* Connection failed. Cleanup and try again */
CloseHandle(pipe);

- /* Set up so we create a new pipe on next loop */
- pipe = INVALID_HANDLE_VALUE;
+ pipe = new_pipe;
}
return 0;
}

This will guarantee that we have an instance of the named pipe available
at any given moment. If we do this, we can also remove the 3 tries loop
from src/port/kill.c:

@@ -25,7 +25,6 @@
BYTE sigData = sig;
BYTE sigRet = 0;
DWORD bytes;
- int pipe_tries;

/* we allow signal 0 here, but it will be ignored in
pg_queue_signal */
if (sig >= PG_SIGNAL_COUNT || sig < 0)
@@ -41,14 +40,6 @@
}
snprintf(pipename, sizeof(pipename), "\\\\.\\pipe\\pgsignal_%u",
pid);

- /*
- * Writing data to the named pipe can fail for transient
reasons.
- * Therefore, it is useful to retry if it fails. The maximum
number of
- * calls to make was empirically determined from a 90-hour
notification
- * stress test.
- */
- for (pipe_tries = 0; pipe_tries < 3; pipe_tries++)
- {
if (CallNamedPipe(pipename, &sigData, 1, &sigRet, 1,
&bytes, 1000))
{
if (bytes != 1 || sigRet != sig)
@@ -58,7 +49,6 @@
}
return 0;
}
- }

if (GetLastError() == ERROR_FILE_NOT_FOUND)
errno = ESRCH;

As a note, the original code has a timeout of 1000 milliseconds
specified in the call to CallNamedPipe in kill.c. That timeout doesn't
help for this bug, because it is the timeout that the client will wait
for an instance of the named pipe to be available for communication.
This means it helps in the case when we have one client already using
that instance to read/write to the server and this second client will
wait until the first one finishes. It does not help if there is no
instance at all. In that case the CallNamedPipe returns immediately with
ERROR_FILE_NOT_FOUND.

Thanks,
Radu

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2010-01-22 22:15:55 Re: commit fests
Previous Message Kevin Grittner 2010-01-22 21:13:30 Re: Largeobject Access Controls (r2460)