I've been trying to figure out why my recent attempt to latch-ify the
stats collector didn't work well on the Windows buildfarm machines.
After a good deal of staring at our code and Microsoft's documentation
I have a theory, which I intend to try out shortly. However, it appears
to me that this code is riddled with problems that are going to take a lot
of work to solve properly. I have neither the interest nor the Windows
development environment to do more than getting the pgstats code to work,
so somebody else is going to have to step up here.
The first issue is that the patched stats collector code expected recv()
on a non-blocking socket to fail with EWOULDBLOCK or a related error code.
pgwin32_recv() gets this exactly backwards, and instead waits for the
socket to come ready using pgwin32_waitforsinglesocket(). I initially
thought this was broken by commit 215cbc90f8db3fc8a70af5572b156f49216c2f70
which introduced the "pgwin32_noblock" hack, but on checking the git
history I think it was wrong before then too.
This wouldn't actually be fatal in itself, as long as the recv call exits
with EINTR on receipt of a signal, which it does; it doesn't much matter
whether the stats collector waits at the recv() or the WaitLatchOrSocket()
call when it has nothing to do.
However, pgwin32_waitforsinglesocket creates an event object and attaches
it to the socket, and leaves things that way on exit. Assuming that we
do sometimes get to the WaitLatchOrSocket() call, that code creates a
different event object and attaches it to the socket. The WSAEventSelect
warns against this, saying
Issuing a WSAEventSelect for a socket cancels any previous
WSAAsyncSelect or WSAEventSelect for the same socket and clears the
internal network event record. For example, to associate an event object
with both reading and writing network events, the application must call
WSAEventSelect with both FD_READ and FD_WRITE, as follows:
rc = WSAEventSelect(s, hEventObject, FD_READ|FD_WRITE);
It is not possible to specify different event objects for different
network events. The following code will not work; the second call will
cancel the effects of the first, and only the FD_WRITE network event
will be associated with hEventObject2:
rc = WSAEventSelect(s, hEventObject1, FD_READ);
rc = WSAEventSelect(s, hEventObject2, FD_WRITE); //bad
So the socket's attachment is getting moved back and forth between two
different event objects. What I suspect is happening is that an incoming
FD_READ event is getting "lost" when it happens concurrently with such a
switch; probably the wrong one of the two event objects gets marked
signaled, and then we just sit and wait. The phraseology about "clearing
the internal network event record" is pretty scary too, even though
elsewhere on the page it's claimed that FD_READ is level-triggered and
so should be re-asserted against a newly selected event object. Another
point is that googling suggests that it's unwise to ResetEvent an event
that's attached to a socket, which is what will happen during repeated
calls to pgwin32_waitforsinglesocket.
What I intend to try doing about this is making
pgwin32_waitforsinglesocket detach its event object from the socket before
returning, ie WSAEventSelect(socket, NULL, 0). This will hopefully
decouple it a bit better from WaitLatchOrSocket.
Other interesting things I learned from the WSAEventSelect man page:
The WSAEventSelect function automatically sets socket s to nonblocking
mode, regardless of the value of lNetworkEvents. To set socket s back to
blocking mode, it is first necessary to clear the event record
associated with socket s via a call to WSAEventSelect with
lNetworkEvents set to zero and the hEventObject parameter set to
NULL. You can then call ioctlsocket or WSAIoctl to set the socket back
to blocking mode.
I imagine this is the reason for the otherwise seemingly brain-dead
decision that pgwin32_recv() should act like non-blocking sockets are
blocking. However, it's still extremely distressing that code that
*wants* to operate a socket in non-blocking mode is not able to do so.
This ought to be fixed, and not with a global flag like pgwin32_noblock
--- it needs to be per-socket.
Also, although it's stated that WSAEventSelect will generally set the
event to signaled state when any requested network event has already
happened, we read this:
The FD_WRITE network event is handled slightly differently. An FD_WRITE
network event is recorded when a socket is first connected with a call
to the connect, ConnectEx, WSAConnect, WSAConnectByList, or
WSAConnectByName function or when a socket is accepted with accept,
AcceptEx, or WSAAccept function and then after a send fails with
WSAEWOULDBLOCK and buffer space becomes available. Therefore, an
application can assume that sends are possible starting from the first
FD_WRITE network event setting and lasting until a send returns
WSAEWOULDBLOCK. After such a failure the application will find out that
sends are again possible when an FD_WRITE network event is recorded and
the associated event object is set.
That is, if you request FD_WRITE events for a pre-existing socket with
WSAEventSelect, you will not get one until the outbound network buffer has
been filled and then has partially emptied. (This is incredibly broken,
but Microsoft evidently has no intention of fixing it.) This behavior
probably explains the bogus-looking code in pgwin32_waitforsinglesocket
and pgwin32_select where we actually issue a zero-length send() to see if
the socket is write-ready. However, it doesn't mean that that code is
right or properly documented. AFAICS the appropriate behavior is to try
*one* send, without any preceding wait, and if that fails with
WSAEWOULDBLOCK then wait normally for FD_WRITE to be signaled.
The claimed connection to UDP-ness of the connection looks like it's
probably a figment of our imagination, as well.
More generally, it seems clear to me that Microsoft's code is designed
around the assumption that an event object remains attached to a socket
for the lifetime of the socket. This business of transiently associating
event objects with sockets looks quite inefficient and is evidently
triggering a lot of unpleasant corner-case behaviors. I wonder whether we
should not try to make "pgsocket" encapsulate a socket and an associated
event object as a single entity on Windows. (Such a struct would be a
good place to keep a per-socket noblock flag, too.) I'm not going to
tackle that myself though.
One other bug I noted is that if pgwin32_waitforsinglesocket gets a
timeout condition, it returns 0 without bothering to set errno. This
seems only latent at the moment since most callers pass INFINITE timeout,
but it probably ought to get cleaned up. Or maybe we should just remove
the timeout option, since once I get pgstat.c fixed there will be no
callers not passing INFINITE timeout.
regards, tom lane