Re: Asymmetric partition-wise JOIN

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Kohei KaiGai <kaigai(at)heterodb(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Asymmetric partition-wise JOIN
Date: 2019-08-23 22:01:41
Message-ID: CA+hUKGJw6ii_jLgkit1FUF8-cXaBt+RFjC+B3quh8sTFa9ge0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 23, 2019 at 4:05 AM Kohei KaiGai <kaigai(at)heterodb(dot)com> wrote:
> We can consider the table join ptable X t1 above is equivalent to:
> (ptable_p0 + ptable_p1 + ptable_p2) X t1
> = (ptable_p0 X t1) + (ptable_p1 X t1) + (ptable_p2 X t1)
> It returns an equivalent result, however, rows are already reduced by HashJoin
> in the individual leaf of Append, so CPU-cycles consumed by Append node can
> be cheaper.
>
> On the other hands, it has a downside because t1 must be read 3 times and
> hash table also must be built 3 times. It increases the expected cost,
> so planner
> may not choose the asymmetric partition-wise join plan.

What if you include the partition constraint as a filter on t1? So you get:

ptable X t1 =
(ptable_p0 X (σ hash(dist)%4=0 (t1))) +
(ptable_p1 X (σ hash(dist)%4=1 (t1))) +
(ptable_p2 X (σ hash(dist)%4=2 (t1))) +
(ptable_p3 X (σ hash(dist)%4=3 (t1)))

Pros:
1. The hash tables will not contain unnecessary junk.
2. You'll get the right answer if t1 is on the outer side of an outer join.
3. If this runs underneath a Parallel Append and t1 is big enough
then workers will hopefully cooperate and do a synchronised scan of
t1.
4. The filter might enable a selective and efficient plan like an index scan.

Cons:
1. The filter might not enable a selective and efficient plan, and
therefore cause extra work.

(It's a little weird in this example because don't usually see hash
functions in WHERE clauses, but that could just as easily be dist
BETWEEN 1 AND 99 or any other partition constraint.)

> One idea I have is, sibling HashJoin shares a hash table that was built once
> by any of the sibling Hash plan. Right now, it is not implemented yet.

Yeah, I've thought a little bit about that in the context of Parallel
Repartition. I'm interested in combining intra-node partitioning
(where a single plan node repartitions data among workers on the fly)
with inter-node partitioning (like PWJ, where partitions are handled
by different parts of the plan, considered at planning time); you
finish up needing to have nodes in the plan that 'receive' tuples for
each partition, to match up with the PWJ plan structure. That's not
entirely unlike CTE references, and not entirely unlike your idea of
somehow sharing the same hash table. I ran into a number of problems
while thinking about that, which I should write about in another
thread.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2019-08-24 00:27:21 Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Previous Message Peter Geoghegan 2019-08-23 21:58:17 Re: Optimize single tuple fetch from nbtree index