Re: [HACKERS] advanced partition matching algorithm for partition-wise join

From: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
To: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, amul sul <sulamul(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, ashutosh(dot)bapat(at)2ndquadrant(dot)com
Subject: Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Date: 2020-03-03 16:48:40
Message-ID: CAExHW5uZAzhNYj9hhgoKj1d3mqJZH6yGqvY=ecBFk9W=7zL9yA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Fujita-san,
I reviewed the patch. Except for the logic of matching the pairs of
partitions from already merged partitions, I think the code changes are
good. But there are several places where it needs further changes to
comments. The attached patch has those. I have described some of them below.
+ * We can not perform partition-wise join if either of the joining
+ * relations is not partitioned.

We are consistently using partitionwise instead of partition-wise.

+ /*
+ * See if the partition bounds for inputs are exactly the same, in
+ * which case we don't need to work hard: partitions with the same
+ * partition indexes will form join pairs, and the join rel will have
+ * the same partition bounds as inputs; otherwise try to merge the
+ * partition bounds along with generating join pairs.

Phrase "joining relations" is better than "inputs", IMO. Updated in the
attached patch.

+ /*
+ * If the partition bounds for the join rel are not merged ones,
+ * inputs are guaranteed to have the same partition bounds, so
+ * partitions with the same partition indexes will form join pairs;
+ * else let get_matching_part_pairs() do the work.
+ */
+ if (joinrel->merged)
+ {

This condition in the comment is opposite to the condition being checked in
code, so looks confusing. BTW this comment is also repeated around line
1405.
See attached patch for correction.

+ /*
+ * If this segment of the join is empty, it means that this segment

"partition of the join" looks consistent with other usages than "segment of
the
join". Modified in the attached patch.

+ /*
+ * Get a relids set of partition(s) involved in this join segment that
+ * are from the rel1 side.
+ */
+ child_relids1 = bms_intersect(child_joinrel->relids,
+ rel1->all_partrels);

The partition bounds are sorted by their values. Even for a list partitioned
table, the bounds are sorted by the least partition value. We do not allow
multiple paritions from one side to be joined with one partition on the
other
and vice versa. All this together means that the partitions of the join
relation are formed by joining partitions from either side in the order of
their bounds. This means that the matching pairs of partitions can be found
by
matching relids of partitions of join with those of the joining relation by
traversing partitions from all the three relations only once in the order
they
appears in partition bounds of corresponding relations. If we use this
algorithm, we don't need all_partrels to be collected and also don't need to
search base or join relation. That, I think, will reduce the time and space
complexity of this logic. Am I missing something? As a side effect it won't
require any special handling for base and join relation.

+ /*
+ * Get a child rel for rel1 with the relids. Note that we should have
+ * the child rel even if rel1 is a join rel, because in that case the
+ * partitions specified in the relids would have matching/overlapping
+ * boundaries, so those partitions should be considered as ones to be
+ * joined even when planning partitionwise joins of rel1, meaning that
+ * the child rel would have been built by the time we get here.
+ */
+ if (rel1_is_simple)

This variable is used only in one place. So instead we should the expression
assigning the value to it. Changed in the attached patch.

- rel->nparts = 0;
+ rel->nparts = -1;

I think we need to add comments around various values that nparts can take.
How
about like something attached.

+ case PARTITION_STRATEGY_HASH:
+ merged_bounds = NULL;

I think, we need to explain why we aren't merging hash partition bounds.
AFAIU,
the reason is thus: When the modulo of used for partition mapping (i.e.
maximum
number of has partitions) is same, their partition bounds are same and do
not
need merging. If the maximum number of partitions is different for both the
joining relations, there's high probability that one partition on one side
will
join with multiple partitions on the other side. So exact partition bounds
match will work in most of the cases. The cases otherwise are not so common
to
spend the effort in coding and planning.

I have added this explanation in the patch. Don't know if it's there written
somewhere already.

+ if (part_index == -1)
+ return -1;
+ } while (is_dummy_partition(rel, part_index));

I understand why we are skipping NULL positions. I am not sure why are we
are
skipping over RelOptInfos which exist but are marked as dummy; we can still
create a join pair with those partitions.

+/*
+ * get_merged_range_bounds
+ * Given the bounds of range partitions to be join, determine the range

s/join/joined/

There are more changes to comments, where I thought that the comments are
required or existing comments need more clarification. Please review the
attached patch. This patch is created on top of
v32-0001-Improve-partition-matching-for-partitionwise-join.

On Mon, Feb 10, 2020 at 5:14 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
wrote:

> On Fri, Feb 7, 2020 at 9:57 PM Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>
> wrote:
> > On Thu, Feb 6, 2020 at 3:55 AM Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
> wrote:
> > > The patches apply and pass all tests. A review of the patch vs.
> master looks reasonable.
>
> I've merged the patches. Attached is a new version of the patch.
>
> > > The partition_join.sql test has multiple levels of partitioning, but
> when your patch extends that test with “advanced partition-wise join”, none
> of the tables for the new section have multiple levels. I spent a little
> while reviewing the code and inventing multiple level partitioning tests
> for advanced partition-wise join and did not encounter any problems. I
> don’t care whether you use this particular example, but do you want to have
> multiple level partitioning in the new test section?
> >
> > Yes, I do.
> >
> > > CREATE TABLE alpha (a double precision, b double precision) PARTITION
> BY RANGE (a);
> > > CREATE TABLE alpha_neg PARTITION OF alpha FOR VALUES FROM
> ('-Infinity') TO (0) PARTITION BY RANGE (b);
> > > CREATE TABLE alpha_pos PARTITION OF alpha FOR VALUES FROM (0) TO
> ('Infinity') PARTITION BY RANGE (b);
> > > CREATE TABLE alpha_nan PARTITION OF alpha FOR VALUES FROM ('Infinity')
> TO ('NaN');
> > > CREATE TABLE alpha_neg_neg PARTITION OF alpha_neg FOR VALUES FROM
> ('-Infinity') TO (0);
> > > CREATE TABLE alpha_neg_pos PARTITION OF alpha_neg FOR VALUES FROM (0)
> TO ('Infinity');
> > > CREATE TABLE alpha_neg_nan PARTITION OF alpha_neg FOR VALUES FROM
> ('Infinity') TO ('NaN');
> > > CREATE TABLE alpha_pos_neg PARTITION OF alpha_pos FOR VALUES FROM
> ('-Infinity') TO (0);
> > > CREATE TABLE alpha_pos_pos PARTITION OF alpha_pos FOR VALUES FROM (0)
> TO ('Infinity');
> > > CREATE TABLE alpha_pos_nan PARTITION OF alpha_pos FOR VALUES FROM
> ('Infinity') TO ('NaN');
> > > INSERT INTO alpha (a, b)
> > > (SELECT * FROM
> > > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8),
> ('Infinity'::float8)) a,
> > > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8),
> ('Infinity'::float8)) b
> > > );
> > > ANALYZE alpha;
> > > ANALYZE alpha_neg;
> > > ANALYZE alpha_pos;
> > > ANALYZE alpha_nan;
> > > ANALYZE alpha_neg_neg;
> > > ANALYZE alpha_neg_pos;
> > > ANALYZE alpha_neg_nan;
> > > ANALYZE alpha_pos_neg;
> > > ANALYZE alpha_pos_pos;
> > > ANALYZE alpha_pos_nan;
> > > CREATE TABLE beta (a double precision, b double precision) PARTITION
> BY RANGE (a, b);
> > > CREATE TABLE beta_lo PARTITION OF beta FOR VALUES FROM (-5, -5) TO (0,
> 0);
> > > CREATE TABLE beta_me PARTITION OF beta FOR VALUES FROM (0, 0) TO (0,
> 5);
> > > CREATE TABLE beta_hi PARTITION OF beta FOR VALUES FROM (0, 5) TO (5,
> 5);
> > > INSERT INTO beta (a, b)
> > > (SELECT * FROM
> > > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) a,
> > > (VALUES (-1.0::float8), (0.0::float8), (1.0::float8)) b
> > > );
> > > ANALYZE beta;
> > > ANALYZE beta_lo;
> > > ANALYZE beta_me;
> > > ANALYZE beta_hi;
> > > EXPLAIN SELECT * FROM alpha INNER JOIN beta ON (alpha.a = beta.a AND
> alpha.b = beta.b) WHERE alpha.a = 1 AND beta.b = 1;
> > > QUERY PLAN
> > >
> -------------------------------------------------------------------------------
> > > Nested Loop (cost=0.00..2.11 rows=1 width=32)
> > > -> Seq Scan on alpha_pos_pos alpha (cost=0.00..1.06 rows=1
> width=16)
> > > Filter: ((b = '1'::double precision) AND (a = '1'::double
> precision))
> > > -> Seq Scan on beta_hi beta (cost=0.00..1.04 rows=1 width=16)
> > > Filter: ((b = '1'::double precision) AND (a = '1'::double
> precision))
> > > (5 rows)
> >
> > Hmm, I'm not sure this is a good test case for that, because this
> > result would be due to partition pruning applied to each side of the
> > join before considering partition-wise join; you could get the same
> > result even with enable_partitionwise_join=off. I think it's
> > important that the partition-wise join logic doesn't break this query,
> > though.
>
> I think this would be beyond the scope of the patch, so I added
> different test cases that I think would be better as ones for
> multi-level partitioning.
>
> Thanks!
>
> Best regards,
> Etsuro Fujita
>

--
--
Best Wishes,
Ashutosh Bapat

Attachment Content-Type Size
changes_on_top_of_v32_0001.patch text/x-patch 11.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Wong 2020-03-03 16:54:29 Re: [GSoC 2020] Questions About Performance Farm Benchmarks and Website
Previous Message Jeff Davis 2020-03-03 16:46:24 Re: Add LogicalTapeSetExtend() to logtape.c