Re: Parallel Full Hash Join

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Greg Nancarrow <gregn4422(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Full Hash Join
Date: 2022-01-11 21:30:37
Message-ID: CAAKRu_YX8PdxHbqV4wEk6v-ivMTwN+tcYG8AtTe3HT3gz=ewmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Nov 26, 2021 at 3:11 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>
> On Sun, Nov 21, 2021 at 4:48 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > On Wed, Nov 17, 2021 at 01:45:06PM -0500, Melanie Plageman wrote:
> > > Yes, this looks like that issue.
> > >
> > > I've attached a v8 set with the fix I suggested in [1] included.
> > > (I added it to 0001).
> >
> > This is still crashing :(
> > https://cirrus-ci.com/task/6738329224871936
> > https://cirrus-ci.com/task/4895130286030848
>
> I added a core file backtrace to cfbot's CI recipe a few days ago, so
> now we have:
>
> https://cirrus-ci.com/task/5676480098205696
>
> #3 0x00000000009cf57e in ExceptionalCondition (conditionName=0x29cae8
> "BarrierParticipants(&accessor->shared->batch_barrier) == 1",
> errorType=<optimized out>, fileName=0x2ae561 "nodeHash.c",
> lineNumber=lineNumber(at)entry=2224) at assert.c:69
> No locals.
> #4 0x000000000071575e in ExecParallelScanHashTableForUnmatched
> (hjstate=hjstate(at)entry=0x80a60a3c8,
> econtext=econtext(at)entry=0x80a60ae98) at nodeHash.c:2224

I believe this assert can be safely removed.

It is possible for a worker to attach to the batch barrier after the
"last" worker was elected to scan and emit unmatched inner tuples. This
is safe because the batch barrier is already in phase PHJ_BATCH_SCAN
and this newly attached worker will simply detach from the batch
barrier and look for a new batch to work on.

The order of events would be as follows:

W1: advances batch to PHJ_BATCH_SCAN
W2: attaches to batch barrier in ExecParallelHashJoinNewBatch()
W1: calls ExecParallelScanHashTableForUnmatched() (2 workers attached to
barrier at this point)
W2: detaches from the batch barrier

The attached v10 patch removes this assert and updates the comment in
ExecParallelScanHashTableForUnmatched().

I'm not sure if I should add more detail about this scenario in
ExecParallelHashJoinNewBatch() under PHJ_BATCH_SCAN or if the detail in
ExecParallelScanHashTableForUnmatched() is sufficient.

- Melanie

Attachment Content-Type Size
v10-0002-Improve-the-naming-of-Parallel-Hash-Join-phases.patch text/x-patch 25.3 KB
v10-0001-Fix-race-condition-in-parallel-hash-join-batch-c.patch text/x-patch 11.0 KB
v10-0003-Parallel-Hash-Full-Right-Outer-Join.patch text/x-patch 24.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2022-01-11 21:35:41 Re: Windows crash / abort handling
Previous Message Alvaro Herrera 2022-01-11 21:30:11 Re: row filtering for logical replication