lockup in parallel hash join on dikkop (freebsd 14.0-current)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date: 2023-01-26 20:36:06
Message-ID: b2bc5c16-899e-ca99-26ed-e623b4259ec7@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
did not report any results for a couple days, and it seems it got into
an infinite loop in REL_11_STABLE when building hash table in a parallel
hashjoin, or something like that.

It seems to be progressing now, probably because I attached gdb to the
workers to get backtraces, which does signals etc.

Anyway, in 'ps ax' I saw this:

94545 - Ss 0:03.39 postgres: buildfarm regression [local] SELECT
94627 - Is 0:00.03 postgres: parallel worker for PID 94545
94628 - Is 0:00.02 postgres: parallel worker for PID 94545

and the backend was stuck waiting on this query:

select final > 1 as multibatch
from hash_join_batches(
$$
select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar
b2 using (id)) ss
on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
$$);

This started on 2023-01-20 23:23:18.125, and the next log (after I did
the gdb stuff), is from 2023-01-26 20:05:16.751. Quite a bit of time.

It seems all three processes are doing WaitEventSetWait, either through
a ConditionVariable, or WaitLatch. But I don't have any good idea of
what might have broken - and as it got "unstuck" I can't investigate
more. But I see there's nodeHash and parallelism, and I recall there's a
lot of gotchas due to how the backends cooperate when building the hash
table, etc. Thomas, any idea what might be wrong?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
94628.bt.txt text/plain 10.7 KB
94627.bt.txt text/plain 9.6 KB
94545.bt.txt text/plain 22.2 KB
query.log text/x-log 1.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-01-26 20:43:25 Re: wrong Append/MergeAppend elision?
Previous Message Peter Geoghegan 2023-01-26 20:32:01 Re: New strategies for freezing, advancing relfrozenxid early