Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, amul sul <sulamul(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Ashutosh Bapat <ashutosh(dot)bapat(at)2ndquadrant(dot)com>
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2020-03-23 13:41:54
Message-ID: CAExHW5staLVrZ_B+ks-S51VPBexQX2rPa7vuaYnj1WdoKhROKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 17, 2020 at 1:44 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com> wrote:
>
> > + /*
> > + * If this segment of the join is empty, it means that this segment
> >
> > "partition of the join" looks consistent with other usages than "segment of the
> > join".
>
> Actually, "segment" is used in the existing comments in the caller
> function try_partitionwise_join(), so I think "segment" is better here
> for consistency.

A segment can be any part of the join relation, not necessarily a
partition. May be we should change the caller.

>
> > + /*
> > + * Get a relids set of partition(s) involved in this join segment that
> > + * are from the rel1 side.
> > + */
> > + child_relids1 = bms_intersect(child_joinrel->relids,
> > + rel1->all_partrels);
> >
> > The partition bounds are sorted by their values. Even for a list partitioned
> > table, the bounds are sorted by the least partition value. We do not allow
> > multiple paritions from one side to be joined with one partition on the other
> > and vice versa. All this together means that the partitions of the join
> > relation are formed by joining partitions from either side in the order of
> > their bounds. This means that the matching pairs of partitions can be found by
> > matching relids of partitions of join with those of the joining relation by
> > traversing partitions from all the three relations only once in the order they
> > appears in partition bounds of corresponding relations.
>
> Consider this 2-way join for list partitioned tables:
>
> CREATE TABLE plt1_ad (a int, b int, c text) PARTITION BY LIST (c);
> CREATE TABLE plt1_ad_p1 PARTITION OF plt1_ad FOR VALUES IN ('0001', '0003');
> CREATE TABLE plt1_ad_p2 PARTITION OF plt1_ad FOR VALUES IN ('0002');
> INSERT INTO plt1_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
> generate_series(1, 100) i WHERE i % 10 in (1, 2, 3);
> ANALYZE plt1_ad;
> CREATE TABLE plt2_ad (a int, b int, c text) PARTITION BY LIST (c);
> CREATE TABLE plt2_ad_p1 PARTITION OF plt2_ad FOR VALUES IN ('0002', '0004');
> CREATE TABLE plt2_ad_p2 PARTITION OF plt2_ad FOR VALUES IN ('0003');
> INSERT INTO plt2_ad SELECT i, i, to_char(i % 10, 'FM0000') FROM
> generate_series(1, 100) i WHERE i % 10 IN (2, 3, 4);
> ANALYZE plt2_ad;
>
> EXPLAIN (COSTS OFF)
> SELECT t1.c, t1.a, t2.a FROM plt1_ad t1 INNER JOIN plt2_ad t2 ON (t1.c
> = t2.c) WHERE t1.a < 10 ORDER BY t1.c, t1.a, t2.a;
> QUERY PLAN
> -----------------------------------------------------
> Sort
> Sort Key: t1.c, t1.a, t2.a
> -> Append
> -> Hash Join
> Hash Cond: (t2_1.c = t1_2.c)
> -> Seq Scan on plt2_ad_p1 t2_1
> -> Hash
> -> Seq Scan on plt1_ad_p2 t1_2
> Filter: (a < 10)
> -> Hash Join
> Hash Cond: (t2_2.c = t1_1.c)
> -> Seq Scan on plt2_ad_p2 t2_2
> -> Hash
> -> Seq Scan on plt1_ad_p1 t1_1
> Filter: (a < 10)
> (15 rows)
>
> As you can see from the EXPLAIN result, the first partition on the
> outer side matches the second partition on the inner side, and the
> second partition on the outer side matches the first partition on the
> inner side. How does the algorithm you proposed work e.g., when an
> N-way join for list partitioned tables contains this join as its lower
> join?

Hmm, this is a good example. I tried to work out the algorithm based
on the bound ordering. The algorithm worked well when all the bounds
on both the sides were included in the join. But it didn't work well,
when some bounds vanished. In order to detect whether a bound has
vanished, we need to either compare that bound with the bounds of join
(an operation costlier than comparing bitmapset) or compare relids of
all the partitions of the join. Either way it looks costlier than what
you have right now. May be we could improve by keeping track of such
lost bounds and corresponding partitions. But I didn't get time to
work on that part. Anyway, even if such an algorithm exists, we will
have to change just a single function. That could be done later I
think. So we are good here right now. Thanks.

> > + if (rel1_is_simple)
> >
> > This variable is used only in one place. So instead we should the expression
> > assigning the value to it. Changed in the attached patch.
>
> I don't think that's a good idea, because this check is done
> repeatedly in a for loop.

Compiler's optimizer would anyway optimize it away. But anyway, I
won't insist on this.

>
> > - rel->nparts = 0;
> > + rel->nparts = -1;
> >
> > I think we need to add comments around various values that nparts can take. How
> > about like something attached.
>
> +1
>
> > + case PARTITION_STRATEGY_HASH:
> > + merged_bounds = NULL;
> >
> > I think, we need to explain why we aren't merging hash partition bounds. AFAIU,
> > the reason is thus: When the modulo of used for partition mapping (i.e. maximum
> > number of has partitions) is same, their partition bounds are same and do not
> > need merging.
>
> I don't think that's always true; there are cases where the moduli are
> the same, but the partition bounds are not, because it's possible to
> only define partitions for some of the remainders. See the discussion
> in [1].

Hmm, but that case would be rare, IMO. It's an artifact of the way our
hash partitioning syntax is, and does not reflect a real world
scenario.

>
> > If the maximum number of partitions is different for both the
> > joining relations, there's high probability that one partition on one side will
> > join with multiple partitions on the other side. So exact partition bounds
> > match will work in most of the cases. The cases otherwise are not so common to
> > spend the effort in coding and planning.
> >
> > I have added this explanation in the patch.
>
> I also think it would be great if we can perform generic partitionwise
> join for hash partitioned tables, so I'd like to propose to add
> something like this, instead: "Currently we support partitionwise join
> for hash partitioned tables only when the partition bounds for them
> exactly match, but later it might be worth the effort to relax the
> restriction."

That's good too. But please include an explanation about the case when
modulo/max no. of partitions itself differs. That case is not likely
to get addressed in nearer future.

--
Best Wishes,
Ashutosh Bapat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Sergei Kornilov 2020-03-23 13:46:52 Re: replay pause vs. standby promotion
Previous Message Laurenz Albe 2020-03-23 13:27:29 Re: Berserk Autovacuum (let's save next Mandrill)