Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-04-21 14:29:17
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Here's an updated patch set

These are same as earlier patch set.

These are new patches to refactor partition bound comparison functions
without passing partition key directly. Changes in the first patch are
being used in this set but the second patch will be useful for more
generic bound matching.

This is same as old set with minor changes. I have moved it ahead of
the other patches as we discussed offline.

Partition bounds are no more part of partition scheme. They appear in
RelOptInfo. We are discussing the data type handling of partition
bounds for join relation. So, I still have partopcintype in partition
scheme. There is one change though. From
ComputePartitionAttrs()->GetDefaultOpClass()/ResolveOpClass(), I
gather that partopcintype is the type used for comparison of partition
bounds instead of parttypid. When they are different they are binary
compatible. So, I have saved partopcintype in PartitionScheme instead
of parttypid and parttypmod.

What was earlier mult-level partition-wise join support patch is now
broken into set of patches and goes with corresponding patch for
single-level partition-wise join patch. The idea is if we agree on
changes for multi-level partitioning support and want to commit it
before partition-wise join support, we can squash those pairs into
one. This also associates the multi-level support changes with
corresponding changes for single-level support. That might be easier
to review.

No changes in this patch.

The patch adds build_joinrel_partition_bounds() to match the partition
bounds of the relations being joined. This function is called from
try_partition_wise_join(). This function is also responsible for
creating the pairs of matching partitions. When we come to support
partition-wise joins for unequal number of partitions, this function
would change without changing rest of the code.

multi-level support

No changes to this patch.

Fixes a crash with mult-level partitioning reported by Rajkumar. Fixes
set_plan_refs code for nested ConvertRowtypeExprs corresponding to
multiple levels of partitions.

No changes to this patch.

Multi-level support changes for 0013

Added the testcases reported by Rajkumar.

On Fri, Apr 21, 2017 at 12:11 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> You seem to be suggesting that we keep as many sets of
>>> partition bounds as there are base relations participating in the join
>>> and then use appropriate partition bounds based on the columns in the
>>> join conditions, so that we can use the same operator as used in the
>>> join condition. That doesn't seem to be a good option since the
>>> partition bounds will all have same values, only differing in their
>>> binary representation because of differences in data types.
>> Well, actually, I think it is a good option, as I wrote in
> I guess, you are now confusing between partition bounds for a join
> relation and partition bounds of base relation. Above paragraph is
> about partition bounds of a join relation. I have already agreed that
> we need to store partition bounds in RelOptInfo. For base relation
> this is trivial; its RelOptInfo has to store partition bounds as
> stored in the partition descriptor of corresponding partitioned table.
> I am talking about partition bounds of a join relation. See below for
> more explanation.
>> In that email, my principal concern was allowing partition-wise join
>> to succeed even with slightly different sets of partition boundaries
>> on the two sides of the join; in particular, if we've got A with A1 ..
>> A10 and B with B1 .. B10 and the DBA adds A11, I don't want
>> performance to tank until the DBA gets around to adding B11. Removing
>> the partition bounds from the PartitionScheme and storing them
>> per-RelOptInfo fixes that problem;
> We have an agreement on this.
>> the fact that it also solves this
>> problem of what happens when we have different data types on the two
>> sides looks to me like a second reason to go that way.
> I don't see how is that fixed. For a join relation we need to come up
> with one set of partition bounds by merging partition bounds of the
> joining relation and in order to understand how to interpret the
> datums in the partition bounds, we need to associate data types. The
> question is which data type we should use if the relations being
> joined have different data types associated with their respective
> partition bounds.
> Or are you saying that we don't need to associate data type with
> merged partition bounds? In that case, I don't know how do we compare
> the partition bounds of two relations?
> In your example, A has partition key of type int8, has bound datums
> X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
> X11. C has partition key type int2 and bound datums X1 .. X12. The
> binary representation of X's is going to differ between A, B and C
> although each Xk for A, B and C is equal, wherever exists. Join
> between A and B will have merged bound datums X1 .. X10 (and X11
> depending upon the join type). In order to match bounds of AB with C,
> we need to know the data type of bounds of AB, so that we can choose
> appropriate equality operator. The question is what should we choose
> as data type of partition bounds of AB, int8 or int4. This is
> different from applying join conditions between AB and C, which can
> choose the right opfamily operator based on the join conditions.
>> And there's a third reason, too, which is that the opfamily mechanism
>> doesn't currently provide any mechanism for reasoning about which data
>> types are "wider" or "narrower" in the way that you want. In general,
>> there's not even a reason why such a relationship has to exist;
>> consider two data types t1 and t2 with opclasses t1_ops and t2_ops
>> that are part of the same opfamily t_ops, and suppose that t1 can
>> represent any positive integer and t2 can represent any even integer,
>> or in general that each data type can represent some but not all of
>> the values that can be represented by the other data type. In such a
>> case, neither would be "wider" than the other in the sense that you
>> need; you essentially want to find a data type within the opfamily to
>> which all values of any of the types involved in the query can be cast
>> without error, but there is nothing today which requires such a data
>> type to exist, and no way to identify which one it is. In practice,
>> for all of the built-in opfamilies that have more than one opclass,
>> such a data type always exists but is not always unique -- in
>> particular, datetime_ops contains date_ops, timestamptz_ops, and
>> timestamp_ops, and either of the latter two is a plausible choice for
>> the "widest" data type of the three. But there's no way to figure
>> that out from the opfamily or opclass information we have today.
>> In theory, it would be possible to modify the opfamily machinery so
>> that every opfamily designates an optional ordering of types from
>> "narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
>> guarantee that every value of type t1 can be cast without error to a
>> value of type t2. But I think that's a bad plan. It means that every
>> opfamily created by either the core code or some extension now needs
>> to worry about annotating the opclass with this new information, and
>> we have to add to core the SQL syntax and supporting code to make that
>> work. If it were implementing a valuable feature which could not
>> practically be implemented without extending the opfamily machinery,
>> then I guess that's what we'd have to suck it up and incur that
>> complexity, but in this case it does not appear necessary. Storing
>> the partition bounds per-RelOptInfo makes this problem -- and a few
>> others -- go away.
> This seems to suggest that we can not come up with merged bounds for
> join if the partition key types of joining relations differ.
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size application/zip 70.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2017-04-21 14:31:57 Re: tablesync patch broke the assumption that logical rep depends on?
Previous Message Peter Eisentraut 2017-04-21 14:23:07 Re: tablesync patch broke the assumption that logical rep depends on?