| From: | Michael Banck <mbanck(at)gmx(dot)net> |
|---|---|
| To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
| Cc: | Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
| Subject: | Re: Maybe BF "timedout" failures are the client script's fault? |
| Date: | 2026-01-09 21:32:55 |
| Message-ID: | 20260109213255.GB21237@p46.dedyn.io;lightning.p46.dedyn.io |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
On Fri, Jan 09, 2026 at 03:41:03PM -0500, Tom Lane wrote:
> We've been assuming that all the "timedout" failures on BF member
> fruitcrow were due to some wonkiness in the GNU/Hurd platform.
> I got suspicious about that though after noticing that there are
> a small number of such failures on other animals, eg [1][2][3].
> In each case, the failure message claims it waited a good long
> time, which is at variance with the actually observed runtime.
> For instance [1] says "timed out after 14400 secs", but the
> actual total test runtime is only 01:24:28 according to the
> summary at the top of the page.
>
> Looking into the buildfarm client, I realized that it's assuming that
> "sleep($wait_time)" is sufficient to wait for $wait_time seconds.
> However, the Perl docs point out that sleep() can be interrupted by a
> signal. So now I'm suspicious that many of these failures are caused
> by a stray signal waking up the wait_timeout thread prematurely.
> GNU/Hurd might just be more prone to that than other platforms.
That might be the case for those other failures, but unfortunately, I
think the fruitcrow failures are really because it gets stuck endlessly
in the test_shm_mq test (it is always that one) and only the test
timeout kicks it out.
I've ran that test manually quite a lot and either it finishes in 10-15
seconds, or (presumably) never. This is not really easy to see in the
public builfarm logs (at least I can't find it on a quick glance), but
I've routinely checked the log timestamps of the runs, and they really
take one hour (wait_timeout) in the case of a hang.
> I propose the attached patch to the BF client to try to make this
> more robust.
Looks sensible, though I wonder whether something should be logged in
case we get woken up early so that we can gather some evidence for this?
Michael
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2026-01-09 21:42:22 | Re: Maybe BF "timedout" failures are the client script's fault? |
| Previous Message | Masahiko Sawada | 2026-01-09 21:32:44 | Re: pg_upgrade: optimize replication slot caught-up check |