Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date: 2023-09-08 19:39:45
Message-ID: CA+hUKG+O-PZOM1f9nSMGQ-3f3b_3F-jJ28Xt+WM9271zkZz4yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 9, 2023 at 7:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> It takes less than 10 minutes on average for me. I checked
> REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined
> forcefully) — they all are affected.
> I could not reproduce the lockup on my Ubuntu box (with HAVE_SYS_EPOLL_H
> undefined manually). And surprisingly for me, I could not reproduce it on
> master and REL_16_STABLE.
> `git bisect` for this behavior change pointed at 7389aad63 (though maybe it
> just greatly decreased probability of the failure; I'm going to double-check
> this).
> In particular, that commit changed this:
> - /*
> - * Ignore SIGURG for now. Child processes may change this (see
> - * InitializeLatchSupport), but they will not receive any such signals
> - * until they wait on a latch.
> - */
> - pqsignal_pm(SIGURG, SIG_IGN); /* ignored */
> -#endif
> + /* This may configure SIGURG, depending on platform. */
> + InitializeLatchSupport();
> + InitProcessLocalLatch();
>
> With debugging logging added I see (on 7389aad63~1) that one process
> really sends SIGURG to another, and the latter reaches poll(), but it
> just got no signal, it's signal handler not called and poll() just waits...

Thanks for working so hard on this Alexander. That is a surprising
discovery! So changes to the signal handler arrangements in the
*postmaster* before the child was forked affected this?

> So it looks like the ARM weak memory model is not the root cause of the
> issue. But as far as I can see, it's still specific to FreeBSD (but not
> specific to a compiler — I used gcc and clang with the same success).

Idea: FreeBSD 13 introduced a new mechanism called sigfastblock[1],
which lets system libraries control signal blocking with atomic memory
tricks in a word of user space memory. I have no particular theory
for why it would be going wrong here (I don't expect us to be using
any of the stuff that would use it, though I don't understand it in
detail so that doesn't say much), but it occurred to me that all
reports so far have been on 13.x or 14. I wonder... If you have a
good fast recipe for reproducing this, could you also try it on
FreeBSD 12.4?

[1] https://man.freebsd.org/cgi/man.cgi?query=sigfastblock&sektion=2&manpath=FreeBSD+13.0-current

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-09-08 21:10:51 Re: About #13489, array dimensions and CREATE TABLE ... LIKE
Previous Message Jacob Champion 2023-09-08 19:27:05 Re: Row pattern recognition