Re: Parallel Full Hash Join

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel Full Hash Join
Date: 2021-04-06 20:56:19
Message-ID: CALNJ-vS5GELo_=kJ7wNjKVPe5qWLXGfLABTC8hhCW1Y3Ls9BJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 6, 2021 at 11:59 AM Melanie Plageman <melanieplageman(at)gmail(dot)com>
wrote:

> On Fri, Apr 2, 2021 at 3:06 PM Zhihong Yu <zyu(at)yugabyte(dot)com> wrote:
> >
> > Hi,
> > For v6-0003-Parallel-Hash-Full-Right-Outer-Join.patch
> >
> > + * current_chunk_idx: index in current HashMemoryChunk
> >
> > The above comment seems to be better fit for
> ExecScanHashTableForUnmatched(), instead of
> ExecParallelPrepHashTableForUnmatched.
> > I wonder where current_chunk_idx should belong (considering the above
> comment and what the code does).
> >
> > + while (hashtable->current_chunk_idx <
> hashtable->current_chunk->used)
> > ...
> > + next = hashtable->current_chunk->next.unshared;
> > + hashtable->current_chunk = next;
> > + hashtable->current_chunk_idx = 0;
> >
> > Each time we advance to the next chunk, current_chunk_idx is reset. It
> seems current_chunk_idx can be placed inside chunk.
> > Maybe the consideration is that, with the current formation we save
> space by putting current_chunk_idx field at a higher level.
> > If that is the case, a comment should be added.
> >
>
> Thank you for the review. I think that moving the current_chunk_idx into
> the HashMemoryChunk would probably take up too much space.
>
> Other places that we loop through the tuples in the chunk, we are able
> to just keep a local idx, like here in
> ExecParallelHashIncreaseNumBuckets():
>
> case PHJ_GROW_BUCKETS_REINSERTING:
> ...
> while ((chunk = ExecParallelHashPopChunkQueue(hashtable,
> &chunk_s)))
> {
> size_t idx = 0;
>
> while (idx < chunk->used)
>
> but, since we cannot do that while also emitting tuples, I thought,
> let's just stash the index in the hashtable for use in serial hash join
> and the batch accessor for parallel hash join. A comment to this effect
> sounds good to me.
>

From the way HashJoinTable is used, I don't have better idea w.r.t. the
location of current_chunk_idx.
It is not worth introducing another level of mapping between HashJoinTable
and the chunk index.

So the current formation is fine with additional comment
in ParallelHashJoinBatchAccessor (current comment doesn't explicitly
mention current_chunk_idx).

Cheers

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-04-06 20:56:49 Re: Autovacuum on partitioned table (autoanalyze)
Previous Message Dan Lynch 2021-04-06 20:16:16 Re: policies with security definer option for allowing inline optimization