Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-04-21 06:41:04
Message-ID: CAFjFpRdSB0oyrtFDxOXBqd2TfY39c0f-t=b4ghij+w_pJ9ZLnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 21, 2017 at 1:34 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> You seem to be suggesting that we keep as many sets of
>> partition bounds as there are base relations participating in the join
>> and then use appropriate partition bounds based on the columns in the
>> join conditions, so that we can use the same operator as used in the
>> join condition. That doesn't seem to be a good option since the
>> partition bounds will all have same values, only differing in their
>> binary representation because of differences in data types.
>
> Well, actually, I think it is a good option, as I wrote in
> http://postgr.es/m/CA+TgmoY-LiJ+_S7OijNU_r2y=dhSj539WTqA7CaYJ-hcEcCdZg@mail.gmail.com

I guess, you are now confusing between partition bounds for a join
relation and partition bounds of base relation. Above paragraph is
about partition bounds of a join relation. I have already agreed that
we need to store partition bounds in RelOptInfo. For base relation
this is trivial; its RelOptInfo has to store partition bounds as
stored in the partition descriptor of corresponding partitioned table.
I am talking about partition bounds of a join relation. See below for
more explanation.

>
> In that email, my principal concern was allowing partition-wise join
> to succeed even with slightly different sets of partition boundaries
> on the two sides of the join; in particular, if we've got A with A1 ..
> A10 and B with B1 .. B10 and the DBA adds A11, I don't want
> performance to tank until the DBA gets around to adding B11. Removing
> the partition bounds from the PartitionScheme and storing them
> per-RelOptInfo fixes that problem;

We have an agreement on this.

> the fact that it also solves this
> problem of what happens when we have different data types on the two
> sides looks to me like a second reason to go that way.

I don't see how is that fixed. For a join relation we need to come up
with one set of partition bounds by merging partition bounds of the
joining relation and in order to understand how to interpret the
datums in the partition bounds, we need to associate data types. The
question is which data type we should use if the relations being
joined have different data types associated with their respective
partition bounds.

Or are you saying that we don't need to associate data type with
merged partition bounds? In that case, I don't know how do we compare
the partition bounds of two relations?

In your example, A has partition key of type int8, has bound datums
X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
X11. C has partition key type int2 and bound datums X1 .. X12. The
binary representation of X's is going to differ between A, B and C
although each Xk for A, B and C is equal, wherever exists. Join
between A and B will have merged bound datums X1 .. X10 (and X11
depending upon the join type). In order to match bounds of AB with C,
we need to know the data type of bounds of AB, so that we can choose
appropriate equality operator. The question is what should we choose
as data type of partition bounds of AB, int8 or int4. This is
different from applying join conditions between AB and C, which can
choose the right opfamily operator based on the join conditions.

>
> And there's a third reason, too, which is that the opfamily mechanism
> doesn't currently provide any mechanism for reasoning about which data
> types are "wider" or "narrower" in the way that you want. In general,
> there's not even a reason why such a relationship has to exist;
> consider two data types t1 and t2 with opclasses t1_ops and t2_ops
> that are part of the same opfamily t_ops, and suppose that t1 can
> represent any positive integer and t2 can represent any even integer,
> or in general that each data type can represent some but not all of
> the values that can be represented by the other data type. In such a
> case, neither would be "wider" than the other in the sense that you
> need; you essentially want to find a data type within the opfamily to
> which all values of any of the types involved in the query can be cast
> without error, but there is nothing today which requires such a data
> type to exist, and no way to identify which one it is. In practice,
> for all of the built-in opfamilies that have more than one opclass,
> such a data type always exists but is not always unique -- in
> particular, datetime_ops contains date_ops, timestamptz_ops, and
> timestamp_ops, and either of the latter two is a plausible choice for
> the "widest" data type of the three. But there's no way to figure
> that out from the opfamily or opclass information we have today.
>
> In theory, it would be possible to modify the opfamily machinery so
> that every opfamily designates an optional ordering of types from
> "narrowest" to "widest", such that saying t1 is-narrower-than t2 is a
> guarantee that every value of type t1 can be cast without error to a
> value of type t2. But I think that's a bad plan. It means that every
> opfamily created by either the core code or some extension now needs
> to worry about annotating the opclass with this new information, and
> we have to add to core the SQL syntax and supporting code to make that
> work. If it were implementing a valuable feature which could not
> practically be implemented without extending the opfamily machinery,
> then I guess that's what we'd have to suck it up and incur that
> complexity, but in this case it does not appear necessary. Storing
> the partition bounds per-RelOptInfo makes this problem -- and a few
> others -- go away.

This seems to suggest that we can not come up with merged bounds for
join if the partition key types of joining relations differ.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2017-04-21 06:46:54 Re: identity columns
Previous Message Noah Misch 2017-04-21 06:34:34 Re: Quorum commit for multiple synchronous replication.