Re: [HACKERS] Parallel Hash take II

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Prabhat Sahu <prabhat(dot)sahu(at)enterprisedb(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject: Re: [HACKERS] Parallel Hash take II
Date: 2017-12-14 09:12:38
Message-ID: 20171214091238.goov4tokdr5wfmbk@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Looking at the main patch (v28).

First off: This looks pretty good, the code's quite readable now
(especially compared to earlier versions), the comments are good. Really
like the nodeHash split, and the always inline hackery in nodeHashjoin.
Think we're getting really really close.

* ExecHashJoinImpl
*
- * This function implements the Hybrid Hashjoin algorithm.
+ * This function implements the Hybrid Hashjoin algorithm. By forcing it
+ * to be always inline many compilers are able to specialize it for
+ * parallel = true/false without repeating the code.
*

what about adding the above explanation for the always inline?

+ /*
+ * So far we have no idea whether there are any other participants,
+ * and if so, what phase they are working on. The only thing we care
+ * about at this point is whether someone has already created the
+ * SharedHashJoinBatch objects, the main hash table for batch 0 and
+ * (if necessary) the skew hash table yet. One backend will be
+ * elected to do that now if necessary.
+ */

The 'yet' sounds a bit weird in combination with the 'already'.

+ static void
+ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
+ ...
+ case PHJ_GROW_BATCHES_ELECTING:
+ /* Elect one participant to prepare the operation. */

That's a good chunk of code. I'm ever so slightly inclined to put that
into a separate function. But I'm not sure it'd look good. Feel entirely
free to disregard.

+ static HashJoinTuple
+ ExecParallelHashLoadTuple(HashJoinTable hashtable, MinimalTuple tuple,
+ dsa_pointer *shared)

Not really happy with the name. ExecParallelHashTableInsert() calling
ExecParallelHashLoadTuple() to insert a tuple into the hashtable doesn't
quite strike me as right; the naming similarity to
ExecParallelHashTableLoad seems problematic too.
ExecParallelHashAllocTuple() or such?

One could argue it'd not be a bad idea to keep a similar split as
dense_alloc() and memcpy() have, but I'm not really convinced by
myself. Hm.

+ case PHJ_GROW_BATCHES_ELECTING:
+ /* Elect one participant to prepare the operation. */

Imo that comment could use a one-line summary of what preparing means.

+ /*
+ * We probably also need a smaller bucket array. How many
+ * tuples do we expect per batch, assuming we have only
+ * half of them so far?

Makes sense, but did cost me a minute of thinking. Maybe add a short
explanation why.

+ case PHJ_GROW_BATCHES_ALLOCATING:
+ /* Wait for the above to be finished. */
+ BarrierArriveAndWait(&pstate->grow_batches_barrier,
+ WAIT_EVENT_HASH_GROW_BATCHES_ALLOCATING);
+ /* Fall through. */

Just to make sure I understand: The "empty" phase is to protect the
process of the electee doing the sizing calculations etc? And the
reason that's not possible to do just by waiting for
PHJ_GROW_BATCHES_REPARTITIONING is that somebody could dynamically
arrive, i.e. it'd not be needed in a statically sized barrier? Pretty
tired here, sorry ;)

+ /* reset temp memory each time to avoid leaks from qual expr */
+ ResetExprContext(econtext);
+
+ if (ExecQual(hjclauses, econtext))

I personally think it's better to avoid this pattern and store the
result of the ExecQual() in a variable, ResetExprContext() and then act
on the result. No need to keep memory around for longer, and for bigger
contexts you're more likely to have all the metadata in cache.

I'd previously thought about introducing ExecQualAndReset() or such...

* IDENTIFICATION
* src/backend/executor/nodeHashjoin.c
*
+ * PARALLELISM
+ *

This is a pretty good explanation. How about adding a reference to it
from nodeHash.c's header?

+static TupleTableSlot * /* return: a tuple or NULL */
+ExecHashJoin(PlanState *pstate)
+{
+ /*
+ * On sufficiently smart compilers this should be inlined with the
+ * parallel-aware branches removed.
+ */
+ return ExecHashJoinImpl(pstate, false);

Ah, the explanation I desired above is here. Still seems good to have a
comment at the somewhat suspicious use of always_inline.

+
+ /*
+ * If we started up so late that the shared batches have been freed
+ * already by ExecHashTableDetach(), then we are finished.
+ */
+ if (!DsaPointerIsValid(hashtable->parallel_state->batches))
+ return false;

This is really the only place that weird condition is detected? And why
is that possible in the first place? And if possible, shouldn't we have
detected earlier? Also, if possible, what prevents this to occur in a
way that test fails, because pstate->batches is freed, but not yet
reset?

+ while ((tuple = sts_parallel_scan_next(hashtable->batches[batchno].inner_tuples,
+ &hashvalue)))

That's a fairly long line. Couldn't this whole block made more readable by
using a temp variable for inner_tuples?

Ok, eye's are closing against my will. This looks pretty damn close.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2017-12-14 09:29:25 Re: procedures and plpgsql PERFORM
Previous Message Christoph Berg 2017-12-14 08:28:08 Re: pgsql: Provide overflow safe integer math inline functions.