Re: buildfarm instance bichir stuck

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Robins Tharakan <tharakan(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: buildfarm instance bichir stuck
Date: 2021-04-07 06:16:28
Message-ID: CA+hUKG+Sm8ZDiyW5Sr-5QZAK377dy=WHoFSU4vu=tgHOqS5JQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 7, 2021 at 5:44 PM Robins Tharakan <tharakan(at)gmail(dot)com> wrote:
> Bichir's been stuck for the past month and is unable to run regression tests since 6a2a70a02018d6362f9841cc2f499cc45405e86b.

Hrmph. That's "Use signalfd(2) for epoll latches." I had a similar
report from an illumos user (but it was intermittent). I have never
seen such a failure on Linux. My first guess is that these two
systems that are doing Linux system call emulation have implemented
subtly different semantics, and something is going wrong like this: a
SIGUSR1 arrives to tell you some important news about a procsignal and
the signal handler calls SetLatch(MyLatch) which does kill(MyProcPid,
SIGURG), but somehow that fails to wake up the epoll() you are
sleeping in which contains the signalfd that should receive the signal
and report it by being readable, due to some internal race. Or
something like that. But I haven't been able to verify that theory
because I don't have any of those computers. If it is indeed
something like that and not a bug in my code, then I was thinking that
the main tool available to deal with it would be to set WAIT_USE_POLL
in the relevant template file, so that we don't use the combination of
epoll + signalfd on illlumos, but then WSL1 thows a spanner in the
works because AFAIK it's masquerading as Ubuntu, running PostgreSQL
from an Ubuntu package with a freaky kernel. Hmm.

> It is interesting that that commit's a month old and probably no other client has complained since, but diving in, I can see that it's been unable to even start regression tests after that commit went in.

Oh, well at least it's easily reproducible then, that's something!

> Note that Bichir is running on WSL1 (not WSL2) - i.e. Windows Subsystem for Linux inside Windows 10 - and so isn't really production use-case. The only run that actually got submitted to Buildfarm was from a few days back when I killed it after a long wait - see [1].
>
> Since yesterday, I have another run that's again stuck on CREATE DATABASE (see outputs below) and although pstack not working may be a limitation of the architecture / installation (unsure), a trace shows it is stuck at poll.

That's actually the client. I guess there is also a backend process
stuck somewhere in epoll_wait()?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-04-07 06:17:48 Re: Can we remove extra memset in BloomInitPage, GinInitPage and SpGistInitPage when we have it in PageInit?
Previous Message Michael Paquier 2021-04-07 06:13:54 Re: Can we remove extra memset in BloomInitPage, GinInitPage and SpGistInitPage when we have it in PageInit?