Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, amul sul <sulamul(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2020-03-24 18:03:11
Message-ID: CAPmGK14dnW-yhwvCTjC7dACvMWefwASiAKxbuUOxxjdRLcsLAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Mon, Mar 23, 2020 at 10:42 PM Ashutosh Bapat
<ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> On Tue, Mar 17, 2020 at 1:44 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:

> > > "partition of the join" looks consistent with other usages than "segment of the
> > > join".
> >
> > Actually, "segment" is used in the existing comments in the caller
> > function try_partitionwise_join(), so I think "segment" is better here
> > for consistency.
>
> A segment can be any part of the join relation, not necessarily a
> partition. May be we should change the caller.

I don't think so. My first language is not English, though. I don't
think this should be a blocker, so how about leaving this for another
patch?

> > > + /*
> > > + * Get a relids set of partition(s) involved in this join segment that
> > > + * are from the rel1 side.
> > > + */
> > > + child_relids1 = bms_intersect(child_joinrel->relids,
> > > + rel1->all_partrels);
> > >
> > > The partition bounds are sorted by their values. Even for a list partitioned
> > > table, the bounds are sorted by the least partition value. We do not allow
> > > multiple paritions from one side to be joined with one partition on the other
> > > and vice versa. All this together means that the partitions of the join
> > > relation are formed by joining partitions from either side in the order of
> > > their bounds. This means that the matching pairs of partitions can be found by
> > > matching relids of partitions of join with those of the joining relation by
> > > traversing partitions from all the three relations only once in the order they
> > > appears in partition bounds of corresponding relations.
> >
> > Consider this 2-way join for list partitioned tables:

> Hmm, this is a good example. I tried to work out the algorithm based
> on the bound ordering. The algorithm worked well when all the bounds
> on both the sides were included in the join. But it didn't work well,
> when some bounds vanished. In order to detect whether a bound has
> vanished, we need to either compare that bound with the bounds of join
> (an operation costlier than comparing bitmapset) or compare relids of
> all the partitions of the join. Either way it looks costlier than what
> you have right now. May be we could improve by keeping track of such
> lost bounds and corresponding partitions. But I didn't get time to
> work on that part. Anyway, even if such an algorithm exists, we will
> have to change just a single function. That could be done later I
> think. So we are good here right now. Thanks.

OK, there is always room for improvement.

> > > + if (rel1_is_simple)
> > >
> > > This variable is used only in one place. So instead we should the expression
> > > assigning the value to it. Changed in the attached patch.
> >
> > I don't think that's a good idea, because this check is done
> > repeatedly in a for loop.
>
> Compiler's optimizer would anyway optimize it away. But anyway, I
> won't insist on this.

OK

> > > + case PARTITION_STRATEGY_HASH:
> > > + merged_bounds = NULL;
> > >
> > > I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
> > > the reason is thus: When the modulo of used for partition mapping (i.e. maximum
> > > number of has partitions) is same, their partition bounds are same and do not
> > > need merging.
> >
> > I don't think that's always true; there are cases where the moduli are
> > the same, but the partition bounds are not, because it's possible to
> > only define partitions for some of the remainders. See the discussion
> > in [1].
>
> Hmm, but that case would be rare, IMO. It's an artifact of the way our
> hash partitioning syntax is, and does not reflect a real world
> scenario.

I agree on that point.

> > > If the maximum number of partitions is different for both the
> > > joining relations, there's high probability that one partition on one side will
> > > join with multiple partitions on the other side. So exact partition bounds
> > > match will work in most of the cases. The cases otherwise are not so common to
> > > spend the effort in coding and planning.
> > >
> > > I have added this explanation in the patch.

> > I'd like to propose to add
> > something like this, instead: "Currently we support partitionwise join
> > for hash partitioned tables only when the partition bounds for them
> > exactly match, but later it might be worth the effort to relax the
> > restriction."
>
> That's good too. But please include an explanation about the case when
> modulo/max no. of partitions itself differs. That case is not likely
> to get addressed in nearer future.

OK, I added comments, including that explanation. Please find
attached a new version of the patch.

On Tue, Mar 17, 2020 at 5:14 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
> On Wed, Mar 4, 2020 at 1:48 AM Ashutosh Bapat
> <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> wrote:
> > + * We can not perform partition-wise join if either of the joining
> > + * relations is not partitioned.
> >
> > We are consistently using partitionwise instead of partition-wise.
>
> Will fix.

Done.

> > + /*
> > + * If the partition bounds for the join rel are not merged ones,
> > + * inputs are guaranteed to have the same partition bounds, so
> > + * partitions with the same partition indexes will form join pairs;
> > + * else let get_matching_part_pairs() do the work.
> > + */
> > + if (joinrel->merged)
> > + {
> >
> > This condition in the comment is opposite to the condition being checked in
> > code, so looks confusing. BTW this comment is also repeated around line 1405.
> > See attached patch for correction.
>
> OK, I'll revise the comments as proposed.

I updated this based on what you suggested.

> > - rel->nparts = 0;
> > + rel->nparts = -1;
> >
> > I think we need to add comments around various values that nparts can take. How
> > about like something attached.
>
> +1

- int nparts; /* number of partitions */
+ int nparts; /* number of partitions.
+ * 0 for a relation with no partitions,
+ * > 0 indicates actual number of partitions
+ * -1 for a relation whose number of partitions
+ * is not yet known.
+ */

I don't think the comment "0 for a relation with no partitions" is
correct. I think 0 means the relation is considered unpartitioned, so
I modified the comments as such. I also changed the format to match
other places in pathnodes.h.

> > +/*
> > + * get_merged_range_bounds
> > + * Given the bounds of range partitions to be join, determine the range
> >
> > s/join/joined/
>
> Good catch! Will fix.

Done.

> > There are more changes to comments, where I thought that the comments are
> > required or existing comments need more clarification. Please review the
> > attached patch.
>
> I will review the patch ASAP.

I looked into the patch.

@@ -4093,8 +4103,13 @@ map_and_merge_partitions(PartitionMap
*outer_map, PartitionMap *inner_map,

- * Note that we will fix the larger index that have been added to
- * the merged_indexes list so far in fix_merged_indexes().
+ * Both the inner and outer partitions have an empty partition on
+ * the other side as their joining partner. But now that each of
+ * them has found a non-empty joining partner we should re-map
+ * those to a single partition in the join. We use lower of the
+ * two indexes to avoid any holes being created by re-mapping.
+ * Also, it keeps the partition index array in partition bounds
+ * roughly sorted.

OK, but I modified this further.

+ * The caller thinks that the partition at the given index does not have a
+ * partition in the other relation or the joining partition is empty. In such a
+ * case we assign a temporary index (indicated by merged flag in the map) for
+ * the resulting partition in the join. In case the given partition finds a
+ * non-empty partner latter we will adjust the mapping again.

Done. I modified this a bit, though.

@@ -4590,6 +4620,13 @@ merge_default_partitions(PartitionMap *outer_map,

+ * We should have already given up if we found that both the inner and
+ * outer relations have default partitions and either of them had a
+ * partition without a matching non-default partition on the other
+ * side. See process_outer_partition() and process_inner_partition()
+ * for details.

This seems to me a bit difficult to read. How about something simpler
in the attached?

I also updated comments in process_outer_partition() and
process_inner_partition(), almost as proposed in your patch.

I polished the patch further. Changes are:

* I removed the arguments parttyplen and parttypbyval for
partition_bounds_merge(), because we don't need them anymore. Also, I
removed these assertions from that function:

+ Assert(merged_bounds || (*outer_parts == NIL && *inner_parts == NIL));
+
+ Assert(list_length(*outer_parts) == list_length(*inner_parts));

because the first assertion isn't correct as we don't allow non-NULL
merged_bounds to have outer_parts=NIL and inner_parts=NIL, and because
the second assertion is redundant as we have the same in
partition_list_bounds_merge() and partition_range_bounds_merge(). I
modified assertions in these functions a bit, though.

* I renamed partition_list_bounds_merge() and
partition_range_bounds_merge() to merge_list_bounds() and
merge_range_bounds(), analogously to create_list_bounds() and
create_range_bounds() in partition_bounds_create(). Does that make
sense?

* In the old versions of the patch, map_and_merge_partitions() checked
whether it was possible to "map" one of given partitions to the other,
but that function doesn't do that anymore. That function assumes that
given partitions match each other, so I don't think the name is good
anymore. How about renaming that function to
merge_matching_partitions() or something like that?

* I think the names of variables used in functions added to
partbounds.c are inconsistent. For example, the names of the outer
and inner partition indexes in partition_list_bounds_merge() are
"o_index" and "i_index", while those in partition_range_bounds_merge()
are "outer_part" and "inner_part". IMO such inconsistencies would
make the code a bit difficult to read, so I modified the names of
those variables to be more consistent between those functions.

* I modified the code in merge_null_partitions() to match other
functions added to partbounds.c, but no functional changes. Also, I
think the names "outer_null_unmatched" and "inner_null_unmatched" are
a bit misleading, because I think the reader might incorrectly think
both the NULL partitions exist, but that isn't always true, so I
renamed them to what I think would be better.

* I added/tweaked comments a lot, fixing (probably my) typos and
grammatical errors.

* I did some cleanup, including fixing some format issues.

* I sorted functions added to partbounds.c in a more logical order,
and moved them to a more appropriate place in that source file.

* I added/modified some regression tests. I added/modified comments a
bit as well. One thing I should mention is a change to this:

+-- 3-way join where not every pair of relations can do partitioned
join
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t2.a, t3.c FROM prt1_ad t1 RIGHT JOIN prt2_ad t2 ON
(t1.a = t2.b) INNER JOIN prt1_ad t3 ON (t2.b = t3.a) WHERE t1.b = 0
ORDER BY t1.a, t2.a, t3.c;

The RIGHT join will be transformed to an INNER join by outer join
reduction IIUC, so this wouldn't be what was originally intended. I
think this is my fault; I modified the original test case incorrectly
when I updated all the regression tests for this feature [1]. Sorry
for that. I fixed this issue.

Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/CAPmGK16LsKXX%3DYYzc-PNiY6aaYApg1Gmkc6A14dnJsrBBmgd0g%40mail.gmail.com

Attachment Content-Type Size
v33-0001-Improve-partition-matching-for-partitionwise-join.patch application/octet-stream 226.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-03-24 18:04:24 Re: backup manifests
Previous Message Tom Lane 2020-03-24 17:57:43 Re: [PATCH] Implement INSERT SET syntax