Re: Parallel Hash take II

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)bowt(dot)ie>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Oleg Golovanov <rentech(at)mail(dot)ru>
Subject: Re: Parallel Hash take II
Date: 2017-08-01 01:11:39
Message-ID: 20170801011139.pgu3v5tk26t5xeon@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date: Wed 26 Jul 2017 19:58:20 NZST
Subject: [PATCH] Add support for parallel-aware hash joins.

Hi,

WRT the main patch:

- Echoing concerns from other threads (Robert: ping): I'm doubtful that
it makes sense to size the number of parallel workers solely based on
the parallel scan node's size. I don't think it's this patch's job to
change that, but to me it seriously amplifys that - I'd bet there's a
lot of cases with nontrivial joins where the benefit from parallelism
on the join level is bigger than on the scan level itself. And the
number of rows in the upper nodes might also be bigger than on the
scan node level, making it more important to have higher number of
nodes.

- If I understand the code in initial_cost_hashjoin() correctly, we
count the synchronization overhead once, independent of the number of
workers. But on the other hand we calculate the throughput by
dividing by the number of workers. Do you think that's right?

- I haven't really grokked the deadlock issue you address. Could you
expand the comments on that? Possibly somewhere central referenced by
the various parts.

- maybe I'm overly paranoid, but it might not be bad to add some extra
checks for ExecReScanHashJoin ensuring that it doesn't get called when
workers are still doing something.

- seems like you're dereffing tuple unnecessarily here:

+ /*
+ * If we detached a chain of tuples, transfer them to the main hash table
+ * or batch storage.
+ */
+ if (regainable_space > 0)
+ {
+ HashJoinTuple tuple;
+
+ tuple = (HashJoinTuple)
+ dsa_get_address(hashtable->area, detached_chain_shared);
+ ExecHashTransferSkewTuples(hashtable, detached_chain,
+ detached_chain_shared);
+
+ /* Remove from the total space used. */
+ LWLockAcquire(&hashtable->shared->chunk_lock, LW_EXCLUSIVE);
+ Assert(hashtable->shared->size >= regainable_space);
+ hashtable->shared->size -= regainable_space;
+ LWLockRelease(&hashtable->shared->chunk_lock);
+
+ /*
+ * If the bucket we removed is the same as the bucket the caller just
+ * overflowed, then we can forget about the overflowing part of the
+ * tuple. It's been moved out of the skew hash table. Otherwise, the
+ * caller will call again; eventually we'll either succeed in
+ * allocating space for the overflow or reach this case.
+ */
+ if (bucket_to_remove == bucketno)
+ {
+ hashtable->spaceUsedSkew = 0;
+ hashtable->spaceAllowedSkew = 0;
+ }
+ }

- The names here could probably improved some:
+ case WAIT_EVENT_HASH_SHRINKING1:
+ event_name = "Hash/Shrinking1";
+ break;
+ case WAIT_EVENT_HASH_SHRINKING2:
+ event_name = "Hash/Shrinking2";
+ break;
+ case WAIT_EVENT_HASH_SHRINKING3:
+ event_name = "Hash/Shrinking3";
+ break;
+ case WAIT_EVENT_HASH_SHRINKING4:
+ event_name = "Hash/Shrinking4";

- why are we restricting rows_total bit to parallel aware?

+ /*
+ * If parallel-aware, the executor will also need an estimate of the total
+ * number of rows expected from all participants so that it can size the
+ * shared hash table.
+ */
+ if (best_path->jpath.path.parallel_aware)
+ {
+ hash_plan->plan.parallel_aware = true;
+ hash_plan->rows_total = best_path->inner_rows_total;
+ }
+

- seems we need a few more test - I don't think the existing tests are
properly going to exercise the skew stuff, multiple batches, etc?
This is nontrivial code, I'd really like to see a high test coverage
of the new code.

- might not hurt to reindent before the final submission

- Unsurprisingly, please implement the FIXME ;)

Regards,

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-08-01 01:14:07 Re: BUG #14758: Segfault with logical replication on a function index
Previous Message David G. Johnston 2017-08-01 01:10:07 Re: BUG #14759: insert into foreign data partitions fail