Quick Links

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: lockup in parallel hash join on dikkop (freebsd 14.0-current)
Date:	2023-01-30 02:22:34
Message-ID:	CA+hUKGKLMJuxq0O600h7uPJ2WKZ6Pip+GG9ahFYnKYEPyS1jHw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Jan 30, 2023 at 6:26 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> out-of-order hazard

I've been trying to understand how that could happen, but my CPU-fu is
weak. Let me try to write an argument for why it can't happen, so
that later I can look back at how stupid and naive I was. We have A
B, and if the CPU sees no dependency and decides to execute B A
(pipelined), shouldn't an interrupt either wait for the whole
schemozzle to commit first (if not in a hurry), or nuke it, handle the
IPI and restart, or something? After an hour of reviewing random
slides from classes on out-of-order execution and reorder buffers and
the like, I think the term for making sure that interrupts run with
the illusion of in-order execution maintained is called "precise
interrupts", and it is expected in all modern architectures, after the
early OoO pioneers lost their minds trying to program without it. I
guess generally you want that because it would otherwise run your
interrupt handler in a completely uncertain environment, and
specifically in this case it would reach our signal handler which
reads A's output (waiting) and writes to B's input (is_set), so B IPI
A surely shouldn't be allowed?

As for compiler barriers, I see that elver's compiler isn't reordering the code.

Maybe it's a much dumber sort of a concurrency problem: stale cache
line due to missing barrier, but... commit db0f6cad488 made us also
set our own latch (a second time) when someone sets our latch in
releases 9.something to 13. Which should mean that we're guaranteed
to see is_set = true in the scenario described, because we'll clobber
it ourselves if we have to, for good measure.

If our secondary SetLatch() sees it's already set and decides not to
set it, then it's possible that the code we interrupted was about to
run ResetLatch(), but any code doing that must next check its expected
exit condition (or it has a common-or-garden latch protocol bug, as
has been discovered from time in the tree...).

/me wanders away with a renewed fear of computers and the vast
complexities they hide

In response to

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) at 2023-01-29 17:26:02 from Thomas Munro

Responses

Re: lockup in parallel hash join on dikkop (freebsd 14.0-current) at 2023-01-30 05:36:50 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Rowley	2023-01-30 02:23:47	Re: Prefetch the next tuple's memory during seqscans
Previous Message	Peter Smith	2023-01-30 00:10:39	Re: Perform streaming logical transactions by background workers and parallel apply