Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, amul sul <sulamul(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2020-04-03 15:15:36
Message-ID: CAPmGK16ik+90SN443o7OwE+Rs+i9vh8v2FiJ+61ipE1E7pbigQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Thu, Apr 2, 2020 at 2:12 AM Ashutosh Bapat
<ashutosh(dot)bapat(at)2ndquadrant(dot)com> wrote:
> On Thu, 26 Mar 2020 at 00:35, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> I've started reviewing the patch a couple days ago. I haven't done any
>> extensive testing, but I do have a bunch of initial comments that I can
>> share now.
>>
>> 1) I wonder if this needs to update src/backend/optimizer/README, which
>> does have a section about partitionwise joins. It seems formulated in a
>> way that that probably covers even this more advanced algorithm, but
>> maybe it should mention handling of default partitions etc.?

> Done. Please check the wording. It might need some word smithy.

You heavily changed the existing documentation about PWJ, but I don't
think we really need to do so. Also, IMO I think the description
about default partitions you added is needed in README. I think it
would be better to put such a description in source files. How about
something like the attached, instead? I wrote part of this based on
the commit message in the original versions of the patch you posted.

>> There certainly needs to be some description of the algorithm somewhere,
>> either in a README or before a suitable function. It doesn't have to be
>> particularly detailed, a rough outline of the matching would be enough,
>> so that readers don't have to rebuild the knowledge from pieces
>> scattered around various comments.

> The algorithm for list and range partitioned tables is slightly different. So, I have added separate prologue to each list_merge_bounds() and range_merge_bounds(). Please check if that serves the purpose.

Too detailed to me. In this:

+ * If there are multiple partitions from one side matching a given partition on
+ * the other side, the algorithm bails out since we do not have machinary for
+ * joining one partition with mulitple partitions. It might happen that any of
+ * the list items of a partition from the outer relation do not appear in the
+ * inner relation and there is no default partition in the inner relation. Such
+ * a partition from the outer side will have no matching partition on the inner
+ * side. The algorithm will bail out in such a case since we do not have a
+ * mechanism to perform a join with a non-existing relation.

I don't think the last comment is correct; that would apply to the old
versions of this function IIRC, but not to the latest version. How
about something much simpler like the attached, instead?

>> 2) Do we need another GUC enabling this more complex algorithm? In PG11
>> the partitionwise join is disabled by default, on the grounds that it's
>> expensive and not worth it by default. How much more expensive is this?
>> Maybe it makes sense to allow enabling only the "simple" approach?

> We have reduced the complexity of merging bounds quite a bit so this shouldn't be costly. Further more we handle the usual case of equal bounds quickly using the merge flag so most of the cases should be fine. It's only when two partitioned tables with same partition scheme are joined but do not have merge-able bounds that this algorithm would not yield useful result - but that would be rare in the field. enable_partitionwise_join = false should take care of such scenarios easily. I am not in favour of adding another GUC which we set to false by default and then take another few releases to make it true by default.

I agree with Ashutosh.

>> 3) This comment in try_partitionwise_join is now incorrect, because the
>> condition may be true even for partitioned tables with (nparts == 0).
>>
>> /* Nothing to do, if the join relation is not partitioned. */
>> if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>> return;

> If part_scheme = NULL, npart should be 0, fixed that in build_joinrel_partition_info(). If partscheme != NULL but bounds can not be merged, nparts = 0. So this condition is correct. Encapsulated in a macro IS_JOINREL_NOT_PARTITITIONED(). and added comments for the macro. Given that the macro is used exactly at one place, it may not be necessary to define it but it looks *nice*.

I don't think it would be a good idea to add such a macro for only one place.

>> Moreover, the condition used to be
>>
>> if (!IS_PARTITIONED_REL(joinrel))
>> return;
>>
>> which is way more readable. I think it's net negative to replace these
>> "nice" macros with clear meaning with complex conditions. If needed, we
>> can invent new macros. There are many other places where the patch
>> replaces macros with less readable conditions.

> The only other place where we have replaced a *nice* macro is in build_joinrel_partition_info(). But I think it's a legit replacement. I have added a comment there.

IIUC, this is the existing issue, so I think it would be better to
leave this for another improvement patch.

>> 6) The try_partitionwise_join function is getting a bit too long and
>> harder to understand. The whole block in
>>
>> if (joinrel->nparts == -1)
>> {
>> ...
>> }
>>
>> seems rather well isolated, so I propose to move it to a separate
>> function responsible only for the merging. We can simply call it on the
>> joinrel, and make it return right away if (joinrel->nparts == -1).

That's a good idea, so +1.

>> 7) I'd suggest not to reference exact functions in comments unless
>> abolutely necessary, because it's harder to maintain and it does not
>> really explain purpose of the struct/code. E.g. consider this:
>>
>> /* Per-partitioned-relation data for merge_list_bounds()/merge_range_bounds() */
>> typedef struct PartitionMap
>> { ... }
>>
>> Why does it matter where is the struct used? That's pretty trivial to
>> find using 'git grep' or something. Instead the comment should explain
>> the purpose of the struct.

> Adjusted names and comments a bit.

I modified the comments a bit further. I don't think we need to
change the name of a variable, so I kept it as is.

On Thu, Apr 2, 2020 at 1:51 AM Ashutosh Bapat
<ashutosh(dot)bapat(at)2ndquadrant(dot)com> wrote:
> On Thu, 26 Mar 2020 at 05:47, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> three more comments after eye-balling the code for a bit longer.
>>
>> 1) The patch probably needs to tweak config.sgml which says this about
>> the enable_partitionwise_join GUC:
>>
>> .. Partitionwise join currently applies only when the join conditions
>> include all the partition keys, which must be of the same data type
>> and have exactly matching sets of child partitions. ..

> Done. Actually this wasn't updated when partition pruning was introduced, which could cause a partitionwise join to be not used even when those conditions were met. Similarly when a query involved whole row reference. It's hard to explain all the conditions under which partitionwise join technique will be used. But I have tried to keep it easy to understand.

IMO I think your words "there is exactly one pair of matching
partitions." is a bit misleading, because that sounds like that PWJ
doesn't allow multiply-segmented join. How about s/exact
matching/one-to-one matching/ in the existing documentation, instead?

>> 2) Do we really need the 'merged' flag in try_partitionwise_join? Can't
>> we simply use the joinrel->merged flag directly? ISTM the we always
>> start with joinrel->merged=false, and then only ever set it to true in
>> some cases. I've tried doing that, per the attached 0002 patch. The
>> regression tests seem to work fine.

I introduced the flag "merged", because I thought it would be verbose
to write "joiners->merged" in many places, but I think removing that
would make the code simpler, so +1 for that change.

> Thanks. I just added a small prologue to compute_partition_bounds().

I tweaked that a bit.

>> I noticed this because I've tried moving part of the function into a
>> separate function, and removing the variable makes that simpler.
>>
>> The patch also does a couple additional minor tweaks.

/*
* Currently, this function is called only from
try_partitionwise_join(),
* so the join type should be INNER, LEFT, FULL, SEMI, or ANTI.
+ *
+ * XXX Maybe an assert would be more appropriate? Or maybe just
+ * bail out by returning NULL? Not sure.
*/
if (jointype != JOIN_INNER && jointype != JOIN_LEFT &&
jointype != JOIN_FULL && jointype != JOIN_SEMI &&
jointype != JOIN_ANTI)

I kept this as originally proposed, but I agree that an assertion
would be more appropriate, because it would save cycles a bit in a
non-assertion-enabled build. I modified this as such. I modified the
test below this as well.

>> 3) I think the for nparts comment is somewhat misleading:
>>
>> int nparts; /* number of partitions; 0 = not partitioned;
>> * -1 = not yet set */
>>
>> which says that nparts=0 means not partitioned. But then there are
>> conditions like this:
>>
>> /* Nothing to do, if the join relation is not partitioned. */
>> if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>> return;
>>
>> which (ignoring the obsolete comment) seems to say we can have nparts==0
>> even for partitioned tables, no?

Yeah, I think I was a bit hasty. I fixed this.

Attached is the original patch (0001) and one patch (0002) with
changes including those by Tomas and Ashutosh. In the 0002 patch I
fixed my many typos as well. :-( Many of them were found by Justin
Pryzby. One of them was found by Ashutosh. Thank you!

>> Anyway, attached is the original patch (0001) and two patches with
>> proposed changes. 0002 removes the "merged" flag as explained in (2),
>> 0003 splits the try_partitionwise_join() function into two parts.

Thanks for the patches, Tomas!

> I have added 0005 with the changes I described in this as well as the previous mail. 0004 is just some white space fixes.

Thanks for the patches, Ashutosh! I fixed the white space issues.

Best regards,
Etsuro Fujita

Attachment Content-Type Size
0001-Improve-partition-matching-for-partitionwise-join.patch application/octet-stream 227.2 KB
0002-Changes.patch application/octet-stream 23.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2020-04-03 15:45:17 Re: Should we add xid_current() or a int8->xid cast?
Previous Message Petr Jelinek 2020-04-03 15:04:23 Re: adding partitioned tables to publications