Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date: 2023-09-01 13:00:29
Message-ID: 2ca4d9b5-ebcd-cc6b-8535-3edbd9dcf630@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/1/23 10:00, Alexander Lakhin wrote:
> Hello Thomas,
>
> 31.08.2023 14:15, Thomas Munro wrote:
>
>> We have a signal that is pending and not blocked, so I don't
>> immediately know why poll() hasn't returned control.
>
> When I worked at the Postgres Pro company, we observed a similar lockup
> under rather specific conditions (we used Elbrus CPU and the specific
> Elbrus
> compiler (lcc) based on edg).
> I managed to reproduce that lockup and Anton Voloshin investigated it.
> The issue was caused by the compiler optimization in WaitEventSetWait():
>     waiting = true;
> ...
>     while (returned_events == 0)
>     {
> ...
>         if (set->latch && set->latch->is_set)
>         {
> ...
>             break;
>         }
>
> In that case, compiler decided that it may place the read
> "set->latch->is_set" before the write "waiting = true".
> (Placing "pg_compiler_barrier();" just after "waiting = true;" fixed the
> issue for us.)
> I can't provide more details for now, but maybe you could look at the
> binary
> code generated on the target platform to confirm or reject my guess.
>

Hmmm, I'm not very good at reading the binary code, but here's what
objdump produced for WaitEventSetWait. Maybe someone will see what the
issue is.

I thought about maybe just adding the barrier in the code, but then how
would we know it's the issue and this fixed it? It happens so rarely we
can't make any conclusions from a couple runs of tests.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
disas.log text/x-log 15.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-09-01 13:04:49 RE: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Peter Eisentraut 2023-09-01 12:59:57 Re: Move bki file pre-processing from initdb to bootstrap