There was no firewall in place, or more correctly the Windows Firewall is configured to be off. There is no other firewall installed on the system.
To get to this point in the code, the return value from WSARecv() was WSAEWOULDBLOCK. The socket is set for overlapped IO and is a datagram socket. MSDN documentation says that means there are too many outstanding overlapped IO requests. I don't know if "too many" applies to just this socket or to the system as a whole. The documentation isn't clear about how to handle the return code in this situation.
We don't need to know if this is a Kernel issue, a bug in winsock, or an undocumented behaviour. Regardless, it can be treated as a fault.
Knowing that it is possible for WaitForMultipleObjectsEx to lock up means that it is not safe to call with an INFINITE timeout. The workaround that's being discussed is beginning to look like the one at line 172 of socket.c. It's bad enough that there is a WSASend in pgwin32_waitforsinglesocket(). I doubt you also want to add a WSARecv. There should be a cleaner way to handle both of these situations.
I am planning to eventually kill the stats collector and see if that clears up the hanging issue, but I want to keep the system state in place for a bit longer in case there is some other diagnostic steps I should try. I've exhausted everything I could think of.
From: Nikhil Sontakke [mailto:nikhil(dot)sontakke(at)enterprisedb(dot)com]
Sent: Monday, August 03, 2009 10:38 AM
To: Magnus Hagander
Cc: Alvaro Herrera; Luke Koops; pgsql-bugs(at)postgresql(dot)org
Subject: Re: [BUGS] BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram
>>>>> Maybe. I'm unsure if it's enough to just try another
>>>>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>>>>> WSARecv() on it as well. Maybe it would be enough to just change
>>>>> the INIFINTE on line 318 of socket.c to a fixed value. That will
>>>>> loop down to WSARecv() which should exit with WSAEWOULDBLOCK which
>>>>> will cause us to do a short sleep and come back. But we'd have to
>>>>> change the limit of 5 somehow then, since in theory we should wait
>>>>> forever. Maybe that outer loop should just be a for(;;), what do you think?
>>>> Yes, line 318 seems to be a much better location to me. If Windows
>>>> and it's socket logic behaves properly most of the times :), most
>>>> of the calls should come out within the first few tries, so
>>>> changing 5 to an infinite loop shouldn't hurt those normal use cases in theory.
>>>> OTOH, I was wondering what if we kill the stats collector and on a
>>>> restart the socket communication resumes properly. Would that
>>>> conclusively mean that it is a flaw in our code?
>>> No, if we kill the stats collector that will destroy all sockets,
>>> and when the new one starts all the sockets it operates on are fresh
>>> and new. So it doesn't show that the flaw is in our code - but it
>>> also doesn't show that it's in the kernel or runtime libraries.
>> AFAICS in the code, the inherited pgStatSock socket FD remains the
>> same across the restart of the stats collector process...
> Partially correct, I think.
> Each backend has it's own handle on win32, since we use EXEC_BACKEND
> (this includes the "utility processes" like the stats collector). When
> we start the new one, we are going to use DuplicateHandle() in
> save_backend_variables(). This will therefor get it a new handle, but
> they are both pointing to the same kernel object. I don't know if
> WaitForMultipleObjectsEx() is going to see these as two different
> objects or not, but I think it does.
Hmm, got it. Nothing like adding more confusion into the mix :)
In response to
pgsql-bugs by date
|Next:||From: Kevin Grittner||Date: 2009-08-04 19:07:10|
|Subject: Re: BUG #4963: Selecting timestamp without timezone at timezone gives wrong output|
|Previous:||From: Magnus Hagander||Date: 2009-08-04 14:32:05|
|Subject: Re: BUG #4962: Pre-existing shared memory block is still in use|