Skip site navigation (1) Skip section navigation (2)

Re: pg_listener entries deleted under heavy NOTIFY load only on Windows

From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-30 14:01:09
Message-ID: 49830825.1000408@wsi.com (view raw or flat)
Thread:
Lists: pgsql-bugs
Tom Lane wrote:

>So my opinion is that the real issue here is why is the kill()
>implementation failing when it should not.  We need to fix that,
>not put band-aids in async.c.
>  
>
I think Tom makes a valid point.  To that end, I think the Window's kill 
implementation needs to be changed to have these properties:
1. kill should generally return quickly and should never hang
2. kill should never fail for a transient reason
3. if kill fails, it should be a good indication that the process is 
dead or we are permanently unable to communicate with it.

The current implementation has property #1, but not #2 and #3.  However, 
I think I've figured out a simple way to modify the Window's 
implementation of pgkill to achieve all three properties.  I will do 
some long term testing of the changes over the weekend, just to check my 
solution works properly over longer time scales.

Here is what I've found so far:
* Contrary to my previous reports, the notification error is always the 
result of pgkill failing with error code 2 (ERROR_FILE_NOT_FOUND).  I 
had previously thought it had issued error 31, but this was just an 
error in my debug message (the signal was 31, i.e. SIGUSR2).
* Also contrary to what I wrote previously, long or infinite timeouts do 
not fix the problem.  With an infinite timeout, I've avoided the problem 
for as long as ten minutes, but it eventually happens.  In some cases, 
the problem even occurred quickly with an infinite timeout.

The solution that seems to work is to call CallNamedPipe repeatedly in a 
loop if it fails.  Currently, I call the function up to a maximum of 5 
times, although in all test cases so far, the code has never needed more 
than 1 retry to succeed.  Based on my testing, I may reduce the maximum 
number of retries.  The code sleeps for 20 ms between retries.  I 
reduced the timeout for CallNamedPipe from 1000 ms to 250 ms after the 
first call, to reduce the total time for the signal if we hit a case 
that needs to timeout.

I also notice that signals that should fail, do fail.  For example, 
signal 30 seems to be regularly sent to pid 1, and this fails in both 
the orignal code and my modified version.

Theoretically, I'm not entirely sure why CallNamedPipe fails 
occasionally, but will succeed when called with the same arguments a 
very short while later.  It's hard to know without being able to see the 
source code.  However, from the Windows documentation, it seems like a 
single named pipe in Windows can have several "instances", which seem to 
be access interfaces to the pipe.  I suspect there is some race 
condition where the code erroneously decides it needs to create an 
"instance" of the pipe, rather than waiting for an instance to become 
available.  When the instance creation fails, it generates the 
FILE_NOT_FOUND error.  

I'll post back on Monday with more complete test results, and, if all 
goes well, a patch.  If anyone has ideas on what else should be tested, 
please let me know.

Steve

In response to

Responses

pgsql-bugs by date

Next:From: Tom LaneDate: 2009-01-30 15:18:28
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Previous:From: Josh BerkusDate: 2009-01-29 17:31:05
Subject: Re: Combination of Triggers and self-FKs produces inconsistent data

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group