Re: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Richard Guo <guofenglinux(at)gmail(dot)com>, Lori Corbani <Lori(dot)Corbani(at)jax(dot)org>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results
Date: 2025-10-30 03:43:44
Message-ID: CA+hUKGK2ken0RAHV1crSaS6zNWDkSWxKdb=S9BSxAUo0CfZx7g@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Oct 28, 2025 at 9:12 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> After looking a little closer, we've got:
>
> if (node->js.jointype == JOIN_RIGHT_SEMI &&
> HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
> continue;
> ...
> if (joinqual == NULL || ExecQual(joinqual, econtext))
> {
> if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
> HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
> As far as I can see, in a parallel hash join node->hj_CurTuple will
> be pointing into a shared hash table, so that this code is fooling
> with a shared HasMatch flag bit with no sort of lock whatsoever.
> This query doesn't have any joinqual at the PHJ level if I'm reading
> EXPLAIN correctly, so the window to get it wrong isn't very wide.
> Nonetheless, if two parallel workers inspect the same inner tuple
> at about the same time, this code would allow both of them to
> return it.

Yeah, that way of writing to it was good enough for the original
purpose: no backend would read it again until after synchronising on a
barrier, it doesn't matter if two backends set it, and though the
relaxed check if it's already set can be wrong, only in an acceptable
direction. We only cared about which tuples were never matched, after
the probe phase is finished.

> I'm unsure if we've got any infrastructure that'd allow setting the
> tuple's match bit in a more atomic fashion.

Interesting problem. My first drive-by thought is that we would need:
some new smaller atomics (that's a 16 bit member), and a decision that
it is OK to use the atomics API by casting (should be OK, since we no
longer have emulated atomics in smaller sizes, so now it's just a
fancy way of accessing memory with "boxed" types that have the same
size and alignment as the underlying type, and plain stores should
work for initialisation if careful about synchronisation). I guess
the op we'd want is atomic_fetch_or with a check of the old value to
see if you lost the race. I suppose the smallest code change would be
to keep the existing code flow that was apparently almost working here
(the window of optimism being just between the relaxed check at the
top and the set at the bottom). If you fail to be the first to set
the bit, you abandon the work done, but it should hopefully be rare
enough and cheaper overall than something "pessimistic" that does
something akin to locking the tuple while thinking about it, and the
time it took for this to be discovered seems to support the
"optimistic" idea (I guess it depends how much processing happens to
determine that it's really a match?). I expect it'd be very important
to skip the atomic op when not needed. I don't know if it'd be worth
a new specialisation to avoid the branching... IIRC previously we've
chosen to set the flag in cases that don't need it on the basis that
the branch to decide might be worse than just setting it always (which
wasn't very scientific... it's also not nice to dirty memory for no
reason, hence the check to avoid setting when already set), but now
that we're talking about expensive memory synchronising ops we'd be
forced to reconsider that, with a branch and optionally a
specialisation to constant-fold it away (a technique we only use today
to strip out the branches in the serial vs parallel versions of the
function, pending more research on what other combinations are worth
spending code size on, but this one was already on the candidate list
to look into...).

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message jian he 2025-10-30 04:07:03 Re: BUG #19099: Conditional DELETE from partitioned table with non-updatable partition raises internal error
Previous Message Richard Guo 2025-10-30 03:24:49 Re: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results