Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date: 2023-09-08 19:00:00
Message-ID: 7b0b1053-0465-b3a6-0e8f-b5291b7510fa@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

03.09.2023 00:00, Alexander Lakhin wrote:
> I'll try to test this guess on the target machine...
>

I got access to dikkop thanks to Tomas Vondra, and started reproducing the
issue. It was rather difficult to catch the lockup as Tomas and Tom
noticed before. I tried to use stress-ng to affect reproduction somehow
and gradually managed to reduce the join_hash test to just a single
query (if I were venturous enough, I could reduce that test to it in
just one step, but I choose to move little by little). At the end I
constructed a reproducing script that could catch the lockup on that ARM
machine during an hour on average.
Today I've booted FreeBSD-14.0-ALPHA4 on my secondary AMD (Ryzen 7 3700X)
machine and managed to reproduce the issue with that script (see
attachment). (I run the script just after `gmake && gmake check`.)
To reproduce it faster on my machine, I run it on tmpfs and use parallel:
for ((i=1;i<=4;i++)); do echo "ITERATION $i"; parallel --halt now,fail=1 -j8 --linebuffer --tag time .../repro+.sh {}
::: 01 02 03 04 || break; done

It produces the lockup as follows:
ITERATION 1
...
02      waiting for server to shut down.... done
02      server stopped
02
02      real    3m2.420s
02      user    0m2.896s
02      sys     0m1.685s
04      TIMEOUT on iteration 448
04
04      real    3m16.212s
04      user    0m2.322s
04      sys     0m1.904s
...
psql -p 15435 regression -c "SELECT * FROM pg_stat_activity;"
 16384 | regression | 53696 |            |       10 | user    | psql             |             |                
|          -1 | 2023-09-08 18:44:27.572426+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 |
2023-09-08 18:44:27.795519+00 | IPC             | HashBuildHashOuter  | active | |          731 |          | explain
(analyze)  select count(*) from simple r join simple s using (id); | client backend
 16384 | regression | 53894 |      53696 |       10 | user    | psql             |             |                
|             | 2023-09-08 18:44:27.796749+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 |
2023-09-08 18:44:27.799261+00 | IPC             | HashBuildHashOuter  | active | |          731 |          | explain
(analyze)  select count(*) from simple r join simple s using (id); | parallel worker
 16384 | regression | 53896 |      53696 |       10 | user    | psql             |             |                
|             | 2023-09-08 18:44:27.797436+00 | 2023-09-08 18:44:27.573633+00 | 2023-09-08 18:44:27.795518+00 |
2023-09-08 18:44:27.799291+00 | IPC             | HashBuildHashInner  | active | |          731 |          | explain
(analyze)  select count(*) from simple r join simple s using (id); | parallel worker

procstat -i 53896
53896 postgres         URG      P-C

tail server04.log
2023-09-08 18:44:27.777 UTC|user|regression|53696|64fb6b8b.d1c0|LOG:  statement: explain (analyze)  select count(*) from
simple r join simple s using (id);
2023-09-08 18:44:27.786 UTC|user|regression|53696|64fb6b8b.d1c0|LOG:  statement: explain (analyze)  select count(*) from
simple r join simple s using (id);
2023-09-08 18:44:27.795 UTC|user|regression|53696|64fb6b8b.d1c0|LOG:  statement: explain (analyze)  select count(*) from
simple r join simple s using (id);
2023-09-08 18:45:38.685 UTC|[unknown]|[unknown]|66915|64fb6bd2.10563|LOG:  connection received: host=[local]
2023-09-08 18:45:38.685 UTC|user|regression|66915|64fb6bd2.10563|LOG:  connection authorized: user=user
database=regression application_name=psql
2023-09-08 18:45:38.686 UTC|user|regression|66915|64fb6bd2.10563|LOG:  statement: SELECT * FROM pg_stat_activity;

It takes less than 10 minutes on average for me. I checked
REL_12_STABLE, REL_13_STABLE, and REL_14_STABLE (with HAVE_KQUEUE undefined
forcefully) — they all are affected.
I could not reproduce the lockup on my Ubuntu box (with HAVE_SYS_EPOLL_H
undefined manually). And surprisingly for me, I could not reproduce it on
master and REL_16_STABLE.
`git bisect` for this behavior change pointed at 7389aad63 (though maybe it
just greatly decreased probability of the failure; I'm going to double-check
this).
In particular, that commit changed this:
-    /*
-     * Ignore SIGURG for now.  Child processes may change this (see
-     * InitializeLatchSupport), but they will not receive any such signals
-     * until they wait on a latch.
-     */
-    pqsignal_pm(SIGURG, SIG_IGN);   /* ignored */
-#endif
+    /* This may configure SIGURG, depending on platform. */
+    InitializeLatchSupport();
+    InitProcessLocalLatch();

With debugging logging added I see (on 7389aad63~1) that one process
really sends SIGURG to another, and the latter reaches poll(), but it
just got no signal, it's signal handler not called and poll() just waits...

So it looks like the ARM weak memory model is not the root cause of the
issue. But as far as I can see, it's still specific to FreeBSD (but not
specific to a compiler — I used gcc and clang with the same success).

Best regards,
Alexander

Attachment Content-Type Size
repro+.sh application/x-shellscript 1.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2023-09-08 19:07:23 Re: proposal: psql: show current user in prompt
Previous Message Nathan Bossart 2023-09-08 17:54:32 Re: Document that server will start even if it's unable to open some TCP/IP ports