Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Date: 2022-01-14 21:59:00
Message-ID: CA+hUKGK4VWGN=pLgbkU_aud4O5ZAACx1dRDk5uw8Jt_jVAk66Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 15, 2022 at 9:28 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think it doesn't even need to touch socket.c to cause breakage. Using two
> different WaitEventSets is enough.

Right. I was interested in your observation because so far we'd
*only* been considering the two-consecutive-WaitEventSets case, which
we could grok experimentally. The patch Alexander tested most
recently uses a tri-state eof flag, so (1) the WES event starts out in
"unknown" state and polls with WSAPoll() to figure out whether the
socket was already closed when it wasn't looking, and then (2) it
switches to believing that we'll definitely get an FD_CLOSE so we
don't need to make the extra call on every wait. That does seem to do
the job, but if there are *other* places that can eat FD_CLOSE event
once we've switched to believing that it will come, that might be
fatal to the second part of that idea, and we might have to assume
"unknown" all the time, which would be somewhat similar to the way we
do a dummy WSASend() every time when waiting for WRITEABLE.

(That patch is assuming that we're looking for something simple to
back-patch, in the event that we decide not to just revert the
graceful shutdown patch from back branches. Reverting might be a
better idea for now, and then we could fix all this stuff going
forward.)

> > Issuing a WSAEventSelect for a socket cancels any previous WSAAsyncSelect or
> > WSAEventSelect for the same socket and clears the internal network event
> > record.
>
> Note the bit about clearing the internal network event record. Which seems to
> pretty precisely explain why we're loosing FD_CLOSEs?

Indeed.

> And it does also explain why this is more likely after the shutdown changes:
> It's more likely the network stack knows it has readable data *and* that the
> connection closed. Which is recorded in the "internal network event
> record". But once all the data is read, walsender.c will do another
> WaitLatchOrSocket(), which does WSAEventSelect(), clearing the "internal event
> record" and loosing the FD_CLOSE.

Yeah.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2022-01-14 22:01:44 Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Previous Message Robert Haas 2022-01-14 21:53:12 Re: Refactoring of compression options in pg_basebackup