Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-04-05 06:42:27
Message-ID: CAFjFpRfJ0cz1KJdFKZGD8yme_NK4-c2OFfYyBqOATMY60gfeSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Apr 5, 2017 at 8:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Apr 4, 2017 at 10:22 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Yes, I agree. For an inner join, the partition key types need to "shrink"
>> and for outer join they need to be "widened". I don't know if there is a way
>> to know "wider" or "shorter" of two given types. We might have to implement
>> a method to merge partition keys to produce partition key of the join, which
>> may be different from either of the partition keys. So, after-all we may
>> have to abandon the idea of canonical partition scheme. I haven't included
>> this change in the attached set of patches.
>
> I think this is why you need to regard the partitioning scheme as
> something more like an equivalence class - possibly the partitioning
> scheme should actually contain (or be?) an equivalence class. Suppose
> this is the query:
>
> SELECT * FROM i4 INNER JOIN i8 ON i4.x = i8.x;
>
> ...where i4 (x) is an int4 partitioning key and i8 (x) is an int8
> partitioning key. It's meaningless to ask whether the result of the
> join is partitioned by int4 or int8. It's partitioned by the
> equivalence class that contains both i4.x and i8.x. If the result of
> this join where joined to another table on either of those two
> columns, a second partition-wise join would be theoretically possible.
> If you insist on knowing the type of the partitioning scheme, rather
> than just the opfamily, you've boxed yourself into a corner from which
> there's no good escape.

Only inner join conditions have equivalence classes associated with
those. Outer join conditions create single element equivalence
classes. So, we can not associate equivalence classes as they are with
partition scheme. If we could do that, it makes life much easier since
checking whether equi-join between all partition keys exist, is simply
looking up equivalence classes that cover joining relations and find
em_member corresponding to partition keys.

It looks like we should only keep strategy, partnatts, partopfamily
and parttypcoll in PartitionScheme. A partition-wise join between two
relations would be possible if all those match. When matching
partition bounds of joining relations, we should rely on partopfamily
to give us comparison function based on the types of partition keys
being joined. In that context it looks like all the partition bound
comparision functions which accept partition key were not written
keeping this use case in mind. They will need to be rewritten to
accept strategy, partnatts, partopfamily and parttypcoll.

There's a relevant comment in 0006, build_joinrel_partition_info()
(probably that name needs to change, but I will do that once we have
settled on design)
+ /*
+ * Construct partition keys for the join.
+ *
+ * An INNER join between two partitioned relations is partition by key
+ * expressions from both the relations. For tables A and B
partitioned by a and b
+ * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
+ * and B.b.
+ *
+ * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
+ * B.b NULL. These rows may not fit the partitioning conditions imposed on
+ * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
+ * Strictly speaking, partition keys of an OUTER join should include
+ * partition key expressions from the OUTER side only. Consider a join like
+ * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
+ * include B.b as partition key expression for (AB), it prohibits us from
+ * using partition-wise join when joining (AB) with C as there is no
+ * equi-join between partition keys of joining relations. But two NULL
+ * values are never equal and no two rows from mis-matching partitions can
+ * join. Hence it's safe to include B.b as partition key expression for
+ * (AB), even though rows in (AB) are not strictly partitioned by B.b.
+ */

I think that also needs to be reviewed carefully. Partition-wise joins
may be happy including partition keys from all sides, but
partition-wise aggregates may not be, esp. when pushing complete
aggregation down to partitions. In that case, rows with NULL partition
key, which falls on nullable side of join, will be spread across
multiple partitions. Proabably, we should separate nullable and
non-nullable partition key expressions.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2017-04-05 06:44:15 Re: Statement timeout behavior in extended queries
Previous Message Sven R. Kunze 2017-04-05 06:41:31 Re: multivariate statistics (v25)