Re: conchuela timeouts since 2021-10-09 system upgrade

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: conchuela timeouts since 2021-10-09 system upgrade
Date: 2021-10-26 14:29:39
Message-ID: 83446.1635258579@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Tue, Oct 26, 2021 at 02:03:54AM -0400, Tom Lane wrote:
>> Or more
>> practically, use advisory locks in that script to enforce that only one
>> runs at once.

> The author did try that.

Hmm ... that ought to have done the trick, I'd think. However:

> Both sound doable, but I don't expect either to fix prairiedog's trouble.

Yeah :-(. I think this test is somehow stumbling over a pre-existing bug.

>> So what we have is that libpq thinks it's sent the next DROP INDEX,
>> but the backend hasn't seen it.

> Thanks for isolating that.

The plot thickens. When I went back to look at that machine this morning,
I found this in the postmaster log:

2021-10-26 02:52:09.324 EDT [1013] 002_cic.pl LOG: statement: DROP INDEX CONCURRENTLY idx;
2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl LOG: could not send data to client: Broken pipe
2021-10-26 02:52:09.352 EDT [1013] 002_cic.pl FATAL: connection to client lost

The timestamps correspond (more or less anyway) to when I killed off the
stuck test run and went to bed. So the DROP command *was* sent, and it
was eventually received by the backend, but it seems to have taken killing
the pgbench process to do it.

I think this probably exonerates the pgbench/libpq side of things, and
instead we have to wonder about a backend or kernel bug. A kernel bug
could possibly explain the unexplainable connection to what's happening on
some other file descriptor. I'd be prepared to believe that prairiedog's
ancient macOS version has some weird bug preventing kevent() from noticing
available data ... but (a) surely conchuela wouldn't share such a bug,
and (b) we've been using kevent() for a couple years now, so how come
we didn't see this before?

Still baffled. I'm currently experimenting to see if the bug reproduces
when latch.c is made to use poll() instead of kevent(). But the failure
rate was low enough that it'll be hours before I can say confidently
that it doesn't (unless, of course, it does).

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Scott Mead 2021-10-26 15:23:41 Re: [BUG] Autovacuum not dynamically decreasing cost_limit and cost_delay
Previous Message Noah Misch 2021-10-26 13:45:00 Re: conchuela timeouts since 2021-10-09 system upgrade