Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date: 2023-01-26 21:06:45
Message-ID: CA+hUKGLtVM4-qxtXMHYp9hjwPdhJSnvBrVfZtAPyizsuGydkAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 27, 2023 at 9:57 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Fri, Jan 27, 2023 at 9:49 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:
> > > I received an alert dikkop (my rpi4 buildfarm animal running freebsd 14)
> > > did not report any results for a couple days, and it seems it got into
> > > an infinite loop in REL_11_STABLE when building hash table in a parallel
> > > hashjoin, or something like that.
> >
> > > It seems to be progressing now, probably because I attached gdb to the
> > > workers to get backtraces, which does signals etc.
> >
> > That reminds me of cases that I saw several times on my now-deceased
> > animal florican:
> >
> > https://www.postgresql.org/message-id/flat/2245838.1645902425%40sss.pgh.pa.us
> >
> > There's clearly something rotten somewhere in there, but whether
> > it's our bug or FreeBSD's isn't clear.
>
> And if it's ours, it's possibly in latch code and not anything higher
> (I mean, not in condition variables, barriers, or parallel hash join)
> because I saw a similar hang in the shm_mq stuff which uses the latch
> API directly. Note that 13 switched to kqueue but still used the
> self-pipe, and 14 switched to a signal event, and this hasn't been
> reported in those releases or later, which makes the poll() code path
> a key suspect.

Also, 14 changed the flag/memory barrier dance (maybe_sleeping), but
13 did it the same way as 11 + 12. So between 12 and 13 we have just
the poll -> kqueue change.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-01-26 21:09:51 Re: suppressing useless wakeups in logical/worker.c
Previous Message Peter Geoghegan 2023-01-26 21:06:31 Re: New strategies for freezing, advancing relfrozenxid early