Quick Links

Re: Maybe BF "timedout" failures are the client script's fault?

From:	Michael Banck <mbanck(at)gmx(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Maybe BF "timedout" failures are the client script's fault?
Date:	2026-01-09 21:32:55
Message-ID:	20260109213255.GB21237@p46.dedyn.io;lightning.p46.dedyn.io
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:
> We've been assuming that all the "timedout" failures on BF member
> fruitcrow were due to some wonkiness in the GNU/Hurd platform.
> I got suspicious about that though after noticing that there are
> a small number of such failures on other animals, eg [1][2][3].
> In each case, the failure message claims it waited a good long
> time, which is at variance with the actually observed runtime.
> For instance [1] says "timed out after 14400 secs", but the
> actual total test runtime is only 01:24:28 according to the
> summary at the top of the page.
>
> Looking into the buildfarm client, I realized that it's assuming that
> "sleep($wait_time)" is sufficient to wait for $wait_time seconds.
> However, the Perl docs point out that sleep() can be interrupted by a
> signal. So now I'm suspicious that many of these failures are caused
> by a stray signal waking up the wait_timeout thread prematurely.
> GNU/Hurd might just be more prone to that than other platforms.

That might be the case for those other failures, but unfortunately, I
think the fruitcrow failures are really because it gets stuck endlessly
in the test_shm_mq test (it is always that one) and only the test
timeout kicks it out.

I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.

> I propose the attached patch to the BF client to try to make this
> more robust.

Looks sensible, though I wonder whether something should be logged in
case we get woken up early so that we can gather some evidence for this?

Michael

In response to

Maybe BF "timedout" failures are the client script's fault? at 2026-01-09 20:41:03 from Tom Lane

Responses

Re: Maybe BF "timedout" failures are the client script's fault? at 2026-01-09 21:42:22 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2026-01-09 21:42:22	Re: Maybe BF "timedout" failures are the client script's fault?
Previous Message	Masahiko Sawada	2026-01-09 21:32:44	Re: pg_upgrade: optimize replication slot caught-up check