Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To: amul sul <sulamul(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2019-09-24 15:59:57
Message-ID: CAPmGK14WHKckT1P6UJV2B63TZAxPyMn8iZJ99XF=AZuNhG6vow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Amul,

On Mon, Sep 2, 2019 at 2:08 PM amul sul <sulamul(at)gmail(dot)com> wrote:
> Please find my comments inline below:

Thank you for the review!

> On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
>> On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:

> On thinking further, a thought hits to me why we can't join two hash partitioned
> table which has the same modulus and partition key specification, but some of
> the partitions are missing from either partitioned table.
>
> For e.g. here is a smaller case where foo has two partitions and bar has only one.
>
> CREATE TABLE foo(a int) PARTITION BY HASH(a);
> CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 0);
> CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 1);
>
> CREATE TABLE bar(a int) PARTITION BY HASH(a); <-- missing partitions for REMAINDER 1
> CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 0);
>
> Explain:
> postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
> QUERY PLAN
> ---------------------------------------------------------------------------------
> Merge Join (cost=590.35..1578.47 rows=65025 width=8)
> Merge Cond: (p2.a = p1.a)
> -> Sort (cost=179.78..186.16 rows=2550 width=4)
> Sort Key: p2.a
> -> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4)
> -> Sort (cost=410.57..423.32 rows=5100 width=4)
> Sort Key: p1.a
> -> Append (cost=0.00..96.50 rows=5100 width=4)
> -> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4)
> -> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4)
> (10 rows)
>
> The partitions-wise join will be performed only if we fill the partition hole that
> exists for the joining table i.e. adding partitions to bar table.
>
> postgres=# CREATE TABLE bar_p1 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 1);
> CREATE TABLE
> postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a;
> QUERY PLAN
> ---------------------------------------------------------------------------------
> Append (cost=359.57..2045.11 rows=65024 width=8)
> -> Merge Join (cost=359.57..860.00 rows=32512 width=8)
> Merge Cond: (p1.a = p2.a)
> -> Sort (cost=179.78..186.16 rows=2550 width=4)
> Sort Key: p1.a
> -> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4)
> -> Sort (cost=179.78..186.16 rows=2550 width=4)
> Sort Key: p2.a
> -> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4)
> -> Merge Join (cost=359.57..860.00 rows=32512 width=8)
> Merge Cond: (p1_1.a = p2_1.a)
> -> Sort (cost=179.78..186.16 rows=2550 width=4)
> Sort Key: p1_1.a
> -> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4)
> -> Sort (cost=179.78..186.16 rows=2550 width=4)
> Sort Key: p2_1.a
> -> Seq Scan on bar_p1 p2_1 (cost=0.00..35.50 rows=2550 width=4)
> (17 rows)
>
> It would have been nice if we could support this case, as we do allow partitions
> hole for the other partition scheme, but there wouldn't be much objection if
> we don't want to add this support for now since there will be a lesser chance
> that hash partitioned table has the hole, IMO.

I agree with you on that point.

>> If there are no objections, I'll merge the attached with the base patch in [1].

> The proposed enhancement in the patch is too good and the patch is pretty much
> reasonable to merge into the main patch.

Done. Attached is a merged version of the patch.

> Here are the few cosmetic fixes for this path I think is needed. Feel free to
> ignore if some of them do not make sense or too obvious.
>
> Note: left side number represents code line number of the patch.
>
> 118 + }
> 119 +
> 120 + /*
> 121 + * Try to create the partition bounds along with join pairs.
> 122 + */
> 123 + if (boundinfo == NULL)
> 124 + {
>
> Can we add this block as else part of previous if condition checking equal partitions bound?

Done.

> 133 + Assert(list_length(parts1) == list_length(parts2));
> 134 + if (boundinfo == NULL)
> 135 + {
> 136 + joinrel->nparts = 0;
> 137 + return;
> 138 + }
> 139 + nparts = list_length(parts1);

> And the question is do we need to do that assert check since partition_bounds_merge()
> does that just before returning, thoughts?

You are right; that assertion would be redundant, so I removed that.

> 204 + RelOptInfo *child_rel1 = merged ? (RelOptInfo *) lfirst(lcr1) : rel1->part_rels[cnt_parts];
> 205 + RelOptInfo *child_rel2 = merged ? (RelOptInfo *) lfirst(lcr2) : rel2->part_rels[cnt_parts];
>
> How about using lfirst_node instead of lfirst & casting explicitly?
>
> Also, these lines crossing 80 column length which I think we need to fix. How about
> doing the assignment as follow, just after the variable declaration part:
>
> if (merged)
> {
> child_rel1 = lfirst_node(lcr1, RelOptInfo);
> child_rel2 = lfirst_node(lcr2, RelOptInfo);
> lcr1 = lnext(parts1, lcr1);
> lcr2 = lnext(parts2, lcr2);
> }
> else
> {
> child_rel1 = rel1->part_rels[cnt_parts];
> child_rel2 = rel2->part_rels[cnt_parts]
> }
>
> rel1_empty = (child_rel1 == NULL || IS_DUMMY_REL(child_rel1));
> rel2_empty = (child_rel2 == NULL || IS_DUMMY_REL(child_rel2));

Done. (I don't think the order of arguments for first_node() above is
correct; the first argument for that should be RelOptInfo.)

> 266 + * get_matching_part_pairs
> 267 + * Generate join pairs of partitions for the two inputs for the join rel
>
> Can we rewrite this description as " Generate join pairs of partitions for the
> join rel from the two inputs." OR "Generate join pairs of partitions for the
> two inputs"

Done.

> 310 + Assert(bms_num_members(child_relids1) == bms_num_members(rel1->relids));
> 311 + /*
>
> 335 + Assert(bms_num_members(child_relids2) == bms_num_members(rel2->relids));
> 336 + /*
>
> Need newline after assert statements.

Done.

Other change: I guess this in partbounds.c would be leftovers from an
older version of the patch, so I removed it.

+/*
+ * Allocate and initialize partition maps. We maintain four maps, two maps
+ * for each joining relation. pmap[i] gives the partition from the other
+ * relation which would join with ith partition of the given relation.
+ * Partition i from the given relation will join with partition pmap[i]
+ * from the other relation to produce partition mmap[i] of the join (merged
+ * partition).
+ *
+ * pmap[i] = -1 indicates that ith partition of a given relation does not
+ * have a matching partition from the other relation.
+ *
+ * mmap[i] = -1 indicates that ith partition of a given relation does not
+ * contribute to the join result. That can happen only when the given
+ * relation is the inner relation and it doesn't have a matching partition
+ * from the outer relation, hence pmap[i] should be -1.
+ *
+ * In case of an outer join, every partition of the outer join will appear
+ * in the join result, and thus has mmap[i] set for all i. But it's not
+ * necessary that every partition on the outer side will have a matching
+ * partition on the inner side. In such a case, we end up with pmap[i] = -1
+ * and mmap[i] != -1.
+ */

I will continue to review the rest of the patch.

I am sorry for the long delay.

Best regards,
Etsuro Fujita

Attachment Content-Type Size
Improve-partition-matching-for-partitionwise-joins-v24.patch application/octet-stream 287.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-09-24 16:01:42 Re: Unwanted expression simplification in PG12b2
Previous Message vignesh C 2019-09-24 15:51:20 Re: dropdb --force