Re: pg_listener entries deleted under heavy NOTIFY load only on Windows

From: "Marshall, Steve" <smarshall(at)wsi(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Date: 2009-01-30 14:01:09
Message-ID: 49830825.1000408@wsi.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tom Lane wrote:

>So my opinion is that the real issue here is why is the kill()
>implementation failing when it should not. We need to fix that,
>not put band-aids in async.c.
>
>
I think Tom makes a valid point. To that end, I think the Window's kill
implementation needs to be changed to have these properties:
1. kill should generally return quickly and should never hang
2. kill should never fail for a transient reason
3. if kill fails, it should be a good indication that the process is
dead or we are permanently unable to communicate with it.

The current implementation has property #1, but not #2 and #3. However,
I think I've figured out a simple way to modify the Window's
implementation of pgkill to achieve all three properties. I will do
some long term testing of the changes over the weekend, just to check my
solution works properly over longer time scales.

Here is what I've found so far:
* Contrary to my previous reports, the notification error is always the
result of pgkill failing with error code 2 (ERROR_FILE_NOT_FOUND). I
had previously thought it had issued error 31, but this was just an
error in my debug message (the signal was 31, i.e. SIGUSR2).
* Also contrary to what I wrote previously, long or infinite timeouts do
not fix the problem. With an infinite timeout, I've avoided the problem
for as long as ten minutes, but it eventually happens. In some cases,
the problem even occurred quickly with an infinite timeout.

The solution that seems to work is to call CallNamedPipe repeatedly in a
loop if it fails. Currently, I call the function up to a maximum of 5
times, although in all test cases so far, the code has never needed more
than 1 retry to succeed. Based on my testing, I may reduce the maximum
number of retries. The code sleeps for 20 ms between retries. I
reduced the timeout for CallNamedPipe from 1000 ms to 250 ms after the
first call, to reduce the total time for the signal if we hit a case
that needs to timeout.

I also notice that signals that should fail, do fail. For example,
signal 30 seems to be regularly sent to pid 1, and this fails in both
the orignal code and my modified version.

Theoretically, I'm not entirely sure why CallNamedPipe fails
occasionally, but will succeed when called with the same arguments a
very short while later. It's hard to know without being able to see the
source code. However, from the Windows documentation, it seems like a
single named pipe in Windows can have several "instances", which seem to
be access interfaces to the pipe. I suspect there is some race
condition where the code erroneously decides it needs to create an
"instance" of the pipe, rather than waiting for an instance to become
available. When the instance creation fails, it generates the
FILE_NOT_FOUND error.

I'll post back on Monday with more complete test results, and, if all
goes well, a patch. If anyone has ideas on what else should be tested,
please let me know.

Steve

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2009-01-30 15:18:28 Re: pg_listener entries deleted under heavy NOTIFY load only on Windows
Previous Message Josh Berkus 2009-01-29 17:31:05 Re: Combination of Triggers and self-FKs produces inconsistent data