I've progressed a bit farther on the dropped notification bug, which amends my previous results.

I modified the error handling code for the kill() call in async.c to check to see if the listenerPID is associated with a running backend process.  To do that, I call IsBackendPid(listenerPID).  I only remove the pg_listener entry if the IsBackendPid returns false.  If the proces exists, I execute the same pg_listener update logic as if kill succeeded. I found this eliminates the problem completely without, altering the timeout in CallNamedPipe.

I also refined my debug statements in pgkill, and determined the system error was different from what I had previously reported.  The GetLastError() value is 31 (ERROR_GEN_FAILURE), not 2 (ERROR_FILE_NOT_FOUND).  ERROR_GEN_FAILURE is translated as "A device attached to the system is not functioning", which sounds at least a bit more like the "pipe is full" or "pipe not available" error I was expecting.

Essentially, the new error handling in async.c allows postgres to fail in its efforts to signal a process about the presence of a NOTIFY event without invalidating the subscription (LISTEN) for that event.  The modified logic still records the presence of an event in pg_listener, if this event has not already been recorded.  I believe this means the listener will pick up this event the next time it is signalled.  AFAICS, this means the event notification won't be dropped, and assuming a heavy load of notifications to the listener, the notification won't even be seriously delayed.

I think the details of the problem are reasonably well known, but we have some choices on implementation of a fix.  As I see it, the fix should be localized in the error handling of the kill command in async.c.  We could use the technique I described, explicitly checking for the listener process with IsBackendPid (patch do this is attached).  Or we could revert to a previous technique used in this code: checking the errno from kill, and only deleting the pg_listener record under certain error conditions.   In that case, we would need to ensure ERROR_GEN_FAILURE is translated to an appropriate errno value.

I think detecting whether or not the listener backend exists is really the check we want.  However, there may be performance considerations to different ways of checking this.

Any thoughts on this or other possible solutions?

Steve

Marshall, Steve wrote:
As I posted before, changing the timeout from 1000 to
NMPWAIT_WAIT_FOREVER fixed the problem, or at least moved it such it
does not occur easily anymore.

To better understand the problem, I added debugging as Tom suggested.  I
restored timeout on CalledNamedPipe 1000 ms, and reran my tests.
Indeed, kill is encountering an error:
LOG:  kill(2168) failed: No such process

I instrumented pgkill to output the value of GetLastError() if
CalledNamedPipe fails.  It returned error code 2, which Windows
identifies as ERROR_FILE_NOT_FOUND.  The logic in pgkill translates this
Windows error into an errno value of ESRCH.

The Windows error is a bit surprising, at least to me -- I expected
something indicating the pipe was full. Does anyone have a richer
interpretation of this error?

Thanks,
Steve

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 

Alvaro Herrera <alvherre@commandprompt.com> writes:
  
Marshall, Steve wrote:
    
Any thoughts on how to confirm or deny Theory A?
      

  
Try changing the 1000 to NMPWAIT_WAIT_FOREVER
    


As long as you're changing the source code, it'd be a good idea to
verify the supposition that kill() is failing, eg in
src/backend/commands/async.c

			if (kill(listenerPID, SIGUSR2) < 0)
			{
+				elog(LOG, "kill(%d) failed: %m",
listenerPID);
				/*
				 * Get rid of pg_listener entry if it
refers to a PID that no
				 * longer exists.  Presumably, that
backend crashed without
				 * deleting its pg_listener entries.
This code used to only


If that's right, sprinkling a few debug printouts into src/port/kill.c
would be the next step.

			regards, tom lane