Quick Links

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From:	Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
To:	amul sul <sulamul(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date:	2019-11-13 04:16:45
Message-ID:	CAPmGK17nCL1bk2tLgk6umj19Foj91qS0Z3=YfBy2qUCjU6FWTA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Amul,

On Wed, Nov 6, 2019 at 3:12 PM amul sul <sulamul(at)gmail(dot)com> wrote:
> Attached is the rebase atop the latest master head(a9056cc637f).

Thanks for that!

> On Tue, Nov 5, 2019 at 6:44 PM amul sul <sulamul(at)gmail(dot)com> wrote:
>> On Fri, Nov 1, 2019 at 3:58 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
>>> Done. Attached is a new version of the patch.

>> A query and comments for v25:
>>
>> 583 + * The function returns NULL if we can not find the matching pair of
>> 584 + * partitions. This happens if 1. multiple partitions on one side match with
>> 585 + * one partition on the other side. 2. a given partition on the outer side
>> 586 + * doesn't have a matching partition on the inner side. We can not support the
>> 587 + * first case since we don't have a way to represent multiple partitions as a
>> 588 + * single relation (RelOptInfo) and then perform join using the ganged
>> 589 + * relation. We can not support the second case since the missing inner
>> 590 + * partition needs to be represented as an empty relation and we don't have a
>> 591 + * way to introduce empty relation during join planning after creating paths
>> 592 + * for all the base relations.
>> 593 + */
>> 594 +PartitionBoundInfo
>> 595 +partition_bounds_merge(int partnatts,
>>
>> I think the second condition mention for partition_bounds_merge() looks
>> outdated, do you think we should remove that or am I missing something here?

Yeah, I think so. In addition to that, the output arguments for that
function *outer_pars and *inner_parts don't store partition indexes
anymore, so I updated comments for that function totally. Does that
make sense?

>> 1768 +
>> 1769 + /*
>> 1770 + * If this is an outer join, the merged partition would act as the
>> 1771 + * default partition of the join; record the index in *default_index
>> 1772 + * if not done yet.
>> 1773 + */
>> 1774 + if (jointype == JOIN_LEFT || jointype == JOIN_ANTI ||
>> 1775 + jointype == JOIN_FULL)
>> 1776 + {
>>
>> As decided need to replace this list by IS_OUTER_JOIN(jointype).

Done.

>> 2020 + if (jointype == JOIN_LEFT || jointype == JOIN_FULL ||
>> 2021 + jointype == JOIN_ANTI)
>> 2022 + {
>>
>> Same as the previous.

Done.

>> I tried a coverage testing and tried to adjust & add a few tests to improved the
>> code coverage for the v25 patch. Please have a look at the attached 0002 & also
>> attach the coverage output with & without this patch, TWIMW.

Good idea! I merged that to the main patch after modifying that a bit
so to reduce the outputs of SELECTs.

Attached is an updated version of the main patch. Thanks for reviewing!

I looked at partition_range_bounds_merge() more closely. Here are my comments:

* When merging partition bounds from the outer/inner sides in the
while loop of that function, we only moves to the next partition on
one side when both partitions overlap (ie, overlap=true) and the upper
bounds of partitions are not the same (ie, ub_cmpval<0 or
ub_cmpval>0); but I think that's inefficient. Let me explain using an
example:

create table prt1 (a int, b int) partition by range (a);
create table prt1_p1 partition of prt1 for values from (0) to (50);
create table prt1_p2 partition of prt1 for values from (100) to (150);
create table prt2 (a int, b int) partition by range (a);
create table prt2_p1 partition of prt2 for values from (0) to (100);
create table prt2_p2 partition of prt2 for values from (100) to (200);
insert into prt1 select i, i from generate_series(0, 49) i;
insert into prt1 select i, i from generate_series(100, 149) i;
insert into prt2 select i, i from generate_series(0, 49) i;
insert into prt2 select i, i from generate_series(100, 149) i;
analyze prt1;
analyze prt2;
set enable_partitionwise_join to on;
explain verbose select * from prt1 t1 full join prt2 t2 on (t1.a = t2.a);
QUERY PLAN
--------------------------------------------------------------------------------------
Append (cost=2.12..9.12 rows=100 width=16)
-> Hash Full Join (cost=2.12..4.31 rows=50 width=16)
Output: t1.a, t1.b, t2.a, t2.b
Hash Cond: (t1.a = t2.a)
-> Seq Scan on public.prt1_p1 t1 (cost=0.00..1.50 rows=50 width=8)
Output: t1.a, t1.b
-> Hash (cost=1.50..1.50 rows=50 width=8)
Output: t2.a, t2.b
-> Seq Scan on public.prt2_p1 t2 (cost=0.00..1.50
rows=50 width=8)
Output: t2.a, t2.b
-> Hash Full Join (cost=2.12..4.31 rows=50 width=16)
Output: t1_1.a, t1_1.b, t2_1.a, t2_1.b
Hash Cond: (t1_1.a = t2_1.a)
-> Seq Scan on public.prt1_p2 t1_1 (cost=0.00..1.50 rows=50 width=8)
Output: t1_1.a, t1_1.b
-> Hash (cost=1.50..1.50 rows=50 width=8)
Output: t2_1.a, t2_1.b
-> Seq Scan on public.prt2_p2 t2_1 (cost=0.00..1.50
rows=50 width=8)
Output: t2_1.a, t2_1.b
(19 rows)

For this query, the merging would proceed as follow:

1) In the first iteration of the while loop, we have overlap=true, so
we merge prt1_p1 and prt2_p1 and move to the next partition on *only
the outer side* (prt1_p2) as we have ub_cmpval<0. 2) In the second
iteration, we have overlap=false and ub_cmpval>0, so we perform
process_inner_partition() to prt2_p1 and move to the next partition on
the inner side (prt2_p2). 3) In the third iteration, we have
overlap=true, so we merge prt1_p2 and prt2_p2 and move to the next
partition on *only the outer side* as we have ub_cmpval<0 (but can't
as the outer side is exhausted). 4) In the forth iteration, we have
overlap=false and ub_cmpval>0, so we perform process_inner_partition()
to prt2_p2 and move to the next partition on the inner side (but can't
as the inner side is exhausted). And we are done.

We do this in the process_inner_partition() call in the second and
forth iterations:

/*
* In range partitioning, if the given inner partition is already
* merged (eg, because we found an overlapping range earlier), we know
* where it fits in the join result; nothing to do in that case. Else
* create a new merged partition.
*/
if (inner_map->merged_indexes[inner_index] >= 0)
{
if (strategy == PARTITION_STRATEGY_LIST)
*merged_index = inner_map->merged_indexes[inner_index];
else
{
Assert(strategy == PARTITION_STRATEGY_RANGE);
*merged_index = -1;
}
}

What we do in that function is actually a no-op, so the second and
forth iterations are done only to move to the next partition on the
inner side, which I think is inefficient. To avoid that, why not move
to *the next pair of partitions* in the first and third iterations
like the attached? This needs an extra check to see if we can safely
move to the next pair of partitions in the first and third iterations,
but doesn't need meaningless iterations like the second and fourth
iterations in the example. This change also makes the code in
process_inner_partition() shown above unnecessary, so I removed it
from that function (and process_outer_partition()).

* I don't like this:

+ if ((lb_cmpval < 0 && inner_has_default) ||
+ /* Non-overlapping range on the lower side of outer range. */
+ (lb_cmpval > 0 && outer_has_default) ||
+ /* Non-overlapping range on the lower side of inner range. */
+ (ub_cmpval < 0 && outer_has_default) ||
+ /* Non-overlapping range on the upper side of inner range. */
+ (ub_cmpval > 0 && inner_has_default))
+ /* Non-overlapping range on the upper side of outer range. */
+ return NULL;

because it looks like that eg, the first comment "Non-overlapping
range on the lower side of outer range." explains the second condition
"(lb_cmpval > 0 && outer_has_default)", which isn't intended, I think.
Also, it seems a bit verbose to me to add a comment to each of the
four cases. How about simplifying the comments and moving them to
above the if block like the attached?

* In partition_range_merge_next_lb(), do we really need this bit?

+ /*
+ * The lower bound is lower than the last upper bound, thus does not fit
+ * the bounds created so far and hence can not be merged with the existing
+ * bounds.
+ */
+ if (cmpval < 0)
+ return false;

I think that if we have such a lower bound, we would error out before
calling that function. Also, I think it would be easier to understand
to add the upper bound as well within that function. So I modified
that function that way (and renamed it). I also added a bit of
comments to that function.

* Since this is called many times, I changed it to a macro:

+static int32
+partition_range_bound_cmp(int partnatts, FmgrInfo *partsupfunc,
+ Oid *partcollations, PartitionRangeBound *bound1,
+ PartitionRangeBound *bound2)
+{
+ return partition_rbound_cmp(partnatts, partsupfunc, partcollations,
+ bound1->datums, bound1->kind, bound1->lower,
+ bound2);
+}

* This is just my taste, but how about renaming
partition_range_bound_cmp() and partition_range_cmp() to more common?
or readable? names eg, compare_range_bounds() and
compare_range_partitions(), respectively? Also, how about renaming
partition_range_merge() to eg, get_merged_range_bounds()?

Please find attached the delta patch for that. Any feedback welcome!

Best regards,
Etsuro Fujita

Attachment	Content-Type	Size
v27-0001-Improve-partition-matching-for-partitionwise-join.patch	application/octet-stream	312.7 KB
v27-0002-Modify-partition_range_bounds_merge.patch	application/octet-stream	31.0 KB

In response to

Re: [HACKERS] advanced partition matching algorithm for partition-wise join at 2019-11-06 06:11:47 from amul sul

Responses

Re: [HACKERS] advanced partition matching algorithm for partition-wise join at 2019-11-15 05:20:39 from amul sul

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2019-11-13 04:18:16	Re: [HACKERS] Block level parallel vacuum
Previous Message	Smith, Peter	2019-11-13 03:51:05	RE: New SQL counter statistics view (pg_stat_sql)