Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: amul sul <sulamul(at)gmail(dot)com>
To: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2019-11-15 05:20:39
Message-ID: CAAJ_b959YcPMvMEUfFC0skGpyNyXvG3NYPicbR9tPQQ10GoW3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 13, 2019 at 9:46 AM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
wrote:

> Hi Amul,
>
> On Wed, Nov 6, 2019 at 3:12 PM amul sul <sulamul(at)gmail(dot)com> wrote:
> > Attached is the rebase atop the latest master head(a9056cc637f).
>
> Thanks for that!
>
> > On Tue, Nov 5, 2019 at 6:44 PM amul sul <sulamul(at)gmail(dot)com> wrote:
> >> On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
> wrote:
> >>> Done. Attached is a new version of the patch.
>
> >> A query and comments for v25:
> >>
> >> 583 + * The function returns NULL if we can not find the matching pair
> of
> >> 584 + * partitions. This happens if 1. multiple partitions on one side
> match with
> >> 585 + * one partition on the other side. 2. a given partition on the
> outer side
> >> 586 + * doesn't have a matching partition on the inner side. We can
> not support the
> >> 587 + * first case since we don't have a way to represent multiple
> partitions as a
> >> 588 + * single relation (RelOptInfo) and then perform join using the
> ganged
> >> 589 + * relation. We can not support the second case since the missing
> inner
> >> 590 + * partition needs to be represented as an empty relation and we
> don't have a
> >> 591 + * way to introduce empty relation during join planning after
> creating paths
> >> 592 + * for all the base relations.
> >> 593 + */
> >> 594 +PartitionBoundInfo
> >> 595 +partition_bounds_merge(int partnatts,
> >>
> >> I think the second condition mention for partition_bounds_merge() looks
> >> outdated, do you think we should remove that or am I missing something
> here?
>
> Yeah, I think so. In addition to that, the output arguments for that
> function *outer_pars and *inner_parts don't store partition indexes
> anymore, so I updated comments for that function totally. Does that
> make sense?
>
Thanks, that's looks better.

[....]
>
> I looked at partition_range_bounds_merge() more closely. Here are my
> comments:
>
> * When merging partition bounds from the outer/inner sides in the
> while loop of that function, we only moves to the next partition on
> one side when both partitions overlap (ie, overlap=true) and the upper
> bounds of partitions are not the same (ie, ub_cmpval<0 or
> ub_cmpval>0); but I think that's inefficient. Let me explain using an
> example:
>
> create table prt1 (a int, b int) partition by range (a);
> create table prt1_p1 partition of prt1 for values from (0) to (50);
> create table prt1_p2 partition of prt1 for values from (100) to (150);
> create table prt2 (a int, b int) partition by range (a);
> create table prt2_p1 partition of prt2 for values from (0) to (100);
> create table prt2_p2 partition of prt2 for values from (100) to (200);
> insert into prt1 select i, i from generate_series(0, 49) i;
> insert into prt1 select i, i from generate_series(100, 149) i;
> insert into prt2 select i, i from generate_series(0, 49) i;
> insert into prt2 select i, i from generate_series(100, 149) i;
> analyze prt1;
> analyze prt2;
> set enable_partitionwise_join to on;
> explain verbose select * from prt1 t1 full join prt2 t2 on (t1.a = t2.a);
> QUERY PLAN
>
> --------------------------------------------------------------------------------------
> Append (cost=2.12..9.12 rows=100 width=16)
> -> Hash Full Join (cost=2.12..4.31 rows=50 width=16)
> Output: t1.a, t1.b, t2.a, t2.b
> Hash Cond: (t1.a = t2.a)
> -> Seq Scan on public.prt1_p1 t1 (cost=0.00..1.50 rows=50
> width=8)
> Output: t1.a, t1.b
> -> Hash (cost=1.50..1.50 rows=50 width=8)
> Output: t2.a, t2.b
> -> Seq Scan on public.prt2_p1 t2 (cost=0.00..1.50
> rows=50 width=8)
> Output: t2.a, t2.b
> -> Hash Full Join (cost=2.12..4.31 rows=50 width=16)
> Output: t1_1.a, t1_1.b, t2_1.a, t2_1.b
> Hash Cond: (t1_1.a = t2_1.a)
> -> Seq Scan on public.prt1_p2 t1_1 (cost=0.00..1.50 rows=50
> width=8)
> Output: t1_1.a, t1_1.b
> -> Hash (cost=1.50..1.50 rows=50 width=8)
> Output: t2_1.a, t2_1.b
> -> Seq Scan on public.prt2_p2 t2_1 (cost=0.00..1.50
> rows=50 width=8)
> Output: t2_1.a, t2_1.b
> (19 rows)
>
> For this query, the merging would proceed as follow:
>
> 1) In the first iteration of the while loop, we have overlap=true, so
> we merge prt1_p1 and prt2_p1 and move to the next partition on *only
> the outer side* (prt1_p2) as we have ub_cmpval<0. 2) In the second
> iteration, we have overlap=false and ub_cmpval>0, so we perform
> process_inner_partition() to prt2_p1 and move to the next partition on
> the inner side (prt2_p2). 3) In the third iteration, we have
> overlap=true, so we merge prt1_p2 and prt2_p2 and move to the next
> partition on *only the outer side* as we have ub_cmpval<0 (but can't
> as the outer side is exhausted). 4) In the forth iteration, we have
> overlap=false and ub_cmpval>0, so we perform process_inner_partition()
> to prt2_p2 and move to the next partition on the inner side (but can't
> as the inner side is exhausted). And we are done.
>
> We do this in the process_inner_partition() call in the second and
> forth iterations:
>
> /*
> * In range partitioning, if the given inner partition is already
> * merged (eg, because we found an overlapping range earlier), we
> know
> * where it fits in the join result; nothing to do in that case.
> Else
> * create a new merged partition.
> */
> if (inner_map->merged_indexes[inner_index] >= 0)
> {
> if (strategy == PARTITION_STRATEGY_LIST)
> *merged_index = inner_map->merged_indexes[inner_index];
> else
> {
> Assert(strategy == PARTITION_STRATEGY_RANGE);
> *merged_index = -1;
> }
> }
>
> What we do in that function is actually a no-op, so the second and
> forth iterations are done only to move to the next partition on the
> inner side, which I think is inefficient. To avoid that, why not move
> to *the next pair of partitions* in the first and third iterations
> like the attached? This needs an extra check to see if we can safely
> move to the next pair of partitions in the first and third iterations,
> but doesn't need meaningless iterations like the second and fourth
> iterations in the example. This change also makes the code in
> process_inner_partition() shown above unnecessary, so I removed it
> from that function (and process_outer_partition()).
>
> Make sense.

> * I don't like this:
>
> + if ((lb_cmpval < 0 && inner_has_default) ||
> + /* Non-overlapping range on the lower side of outer range.
> */
> + (lb_cmpval > 0 && outer_has_default) ||
> + /* Non-overlapping range on the lower side of inner range.
> */
> + (ub_cmpval < 0 && outer_has_default) ||
> + /* Non-overlapping range on the upper side of inner range.
> */
> + (ub_cmpval > 0 && inner_has_default))
> + /* Non-overlapping range on the upper side of outer range.
> */
> + return NULL;
>
> because it looks like that eg, the first comment "Non-overlapping
> range on the lower side of outer range." explains the second condition
> "(lb_cmpval > 0 && outer_has_default)", which isn't intended, I think.
> Also, it seems a bit verbose to me to add a comment to each of the
> four cases. How about simplifying the comments and moving them to
> above the if block like the attached?
>
> Check.

> * In partition_range_merge_next_lb(), do we really need this bit?
>
> + /*
> + * The lower bound is lower than the last upper bound, thus does not
> fit
> + * the bounds created so far and hence can not be merged with the
> existing
> + * bounds.
> + */
> + if (cmpval < 0)
> + return false;
>
> I think that if we have such a lower bound, we would error out before
> calling that function. Also, I think it would be easier to understand
> to add the upper bound as well within that function. So I modified
> that function that way (and renamed it). I also added a bit of
> comments to that function.
>
> * Since this is called many times, I changed it to a macro:
>
> +static int32
> +partition_range_bound_cmp(int partnatts, FmgrInfo *partsupfunc,
> + Oid *partcollations, PartitionRangeBound *bound1,
> + PartitionRangeBound *bound2)
> +{
> + return partition_rbound_cmp(partnatts, partsupfunc, partcollations,
> + bound1->datums, bound1->kind,
> bound1->lower,
> + bound2);
> +}
>
> Check.

> * This is just my taste, but how about renaming
> partition_range_bound_cmp() and partition_range_cmp() to more common?
> or readable? names eg, compare_range_bounds() and
> compare_range_partitions(), respectively? Also, how about renaming
> partition_range_merge() to eg, get_merged_range_bounds()?
>
> Check.

Please find attached the delta patch for that. Any feedback welcome!
>
>
Thank you Fujita san for the patch & the enhancements. I am fine with your
delta patch. I would like to share some thought for other code:

File: partbounds.c:
3298 static void
3299 get_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
3300 Oid *partcollations, JoinType jointype,
3301 PartitionRangeBound *left_lb,
3302 PartitionRangeBound *left_ub,
3303 PartitionRangeBound *right_lb,
3304 PartitionRangeBound *right_ub,
3305 PartitionRangeBound *merged_lb,
3306 PartitionRangeBound *merged_ub)

Can we rename these argument as inner_* & outer_* like we having for the
functions, like 0003 patch?
---

File: partbounds.c:
3322
3323 case JOIN_INNER:
3324 case JOIN_SEMI:
3325 if (compare_range_bounds(partnatts, partsupfuncs,
partcollations,
3326 left_ub, right_ub) < 0)
3327 *merged_ub = *left_ub;
3328 else
3329 *merged_ub = *right_ub;
3330
3331 if (compare_range_bounds(partnatts, partsupfuncs,
partcollations,
3332 left_lb, right_lb) > 0)
3333 *merged_lb = *left_lb;
3334 else
3335 *merged_lb = *right_lb;
3336 break;
3337

How about reusing ub_cmpval & lb_cmpval instead of calling
compare_range_bounds() inside get_merged_range_bounds(), like 0004 patch?
--

Apart from this, I would like to propose 0005 cleanup patch where I have
rearranged function arguments & code to make sure the arguments & the code
related to lower bound should come first and then the upper bound.

Regards,
Amul

P.S: As usual, reattaching 0001 & 0002 patches.

Attachment Content-Type Size
v27-0002-Modify-partition_range_bounds_merge.patch application/octet-stream 31.0 KB
v27-0003-renage-get_merged_range_bounds-args.patch application/octet-stream 3.4 KB
v27-0004-pass-and-reuse-in-get_merged_range_bounds.patch application/octet-stream 3.1 KB
v27-0005-cleanup-rearrage-arguments-minor-comment.patch application/octet-stream 3.0 KB
v27-0001-Improve-partition-matching-for-partitionwise-join.patch application/octet-stream 305.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rushabh Lathia 2019-11-15 06:44:53 Re: Optimize partial TOAST decompression
Previous Message Thomas Munro 2019-11-15 04:41:12 Re: BUG #16079: Question Regarding the BUG #16064