Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date: 2023-01-27 09:23:58
Message-ID: CA+hUKG+YkAnOLrKKcy-FLjoVUV3r=L+c28gzMSL58Cv9jC4nvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

After 1000 make check loops, and 1000 make -C src/test/modules/test_shm_mq
check loops, on the same FBSD 13.1 machine as elver which has failed
like this once before, I haven't been able to reproduce this on
REL_12_STABLE. Not really sure how to chase this, but if you see this
situation again, I'd been interested to see the output of fstat -p PID
(shows bytes in pipes) and procstat -j PID (shows pending signals) for
all PIDs involved (before connecting a debugger or doing anything else
that might make it return with EINTR, after which we know it continues
happily because it then sees latch->is_set next time around the loop).
If poll() is not returning when there are bytes ready to read from the
self-pipe, which fstat can show, I think that'd indicate a kernel bug.
If procstat -j shows signals pending but somehow it's still blocked in
the syscall. Otherwise, it might indicate a compiler or postgres bug,
but I don't have any particular theories.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-01-27 09:35:01 Re: Improve WALRead() to suck data directly from WAL buffers when possible
Previous Message Andres Freund 2023-01-27 09:02:53 Re: New strategies for freezing, advancing relfrozenxid early