Re: Parallel Full Hash Join

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Ian Lawrence Barwick <barwick(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Full Hash Join
Date: 2023-04-08 18:19:54
Message-ID: CAAKRu_Yvpd08MvXO3yeevmXeqykw+Fbh02VVefWuvqKyg6j8iQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 8, 2023 at 1:30 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Sat, Apr 8, 2023 at 12:33 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > > I committed the main patch.
> >
> > BTW, it was easy to miss in all the buildfarm noise from
> > last-possible-minute patches, but chimaera just showed something
> > that looks like a bug in this code [1]:
> >
> > 2023-04-08 12:25:28.709 UTC [18027:321] pg_regress/join_hash LOG: statement: savepoint settings;
> > 2023-04-08 12:25:28.709 UTC [18027:322] pg_regress/join_hash LOG: statement: set local max_parallel_workers_per_gather = 2;
> > 2023-04-08 12:25:28.710 UTC [18027:323] pg_regress/join_hash LOG: statement: explain (costs off)
> > select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
> > 2023-04-08 12:25:28.710 UTC [18027:324] pg_regress/join_hash LOG: statement: select count(*) from simple r full outer join simple s on (r.id = 0 - s.id);
> > TRAP: failed Assert("BarrierParticipants(&batch->batch_barrier) == 1"), File: "nodeHash.c", Line: 2118, PID: 19147

So, after staring at this for awhile, I suspect this assertion is just
plain wrong. BarrierArriveAndDetachExceptLast() contains this code:

if (barrier->participants > 1)
{
--barrier->participants;
SpinLockRelease(&barrier->mutex);

return false;
}
Assert(barrier->participants == 1);

So in between this assertion and the one we tripped,

if (!BarrierArriveAndDetachExceptLast(&batch->batch_barrier))
{
...
return false;
}

/* Now we are alone with this batch. */
Assert(BarrierPhase(&batch->batch_barrier) == PHJ_BATCH_SCAN);
Assert(BarrierParticipants(&batch->batch_barrier) == 1);

Another worker attached to the batch barrier, saw that it was in
PHJ_BATCH_SCAN, marked it done and detached. This is fine.
BarrierArriveAndDetachExceptLast() is meant to ensure no one waits
(deadlock hazard) and that at least one worker stays to do the unmatched
scan. It doesn't hurt anything for another worker to join and find out
there is no work to do.

We should simply delete this assertion.

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-04-08 18:40:38 Re: longfin missing gssapi_ext.h
Previous Message Andres Freund 2023-04-08 18:08:16 Re: Direct I/O