Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-05-18 08:38:40
Message-ID: CAFjFpReKnTXPkX68B+vdGC8F0eJ50TYqOzUte02=G8KF5x3A0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> There's a relevant comment in 0006, build_joinrel_partition_info()
>> (probably that name needs to change, but I will do that once we have
>> settled on design)
>> + /*
>> + * Construct partition keys for the join.
>> + *
>> + * An INNER join between two partitioned relations is partition by key
>> + * expressions from both the relations. For tables A and B
>> partitioned by a and b
>> + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a
>> + * and B.b.
>> + *
>> + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with
>> + * B.b NULL. These rows may not fit the partitioning conditions imposed on
>> + * B.b. Hence, strictly speaking, the join is not partitioned by B.b.
>> + * Strictly speaking, partition keys of an OUTER join should include
>> + * partition key expressions from the OUTER side only. Consider a join like
>> + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not
>> + * include B.b as partition key expression for (AB), it prohibits us from
>> + * using partition-wise join when joining (AB) with C as there is no
>> + * equi-join between partition keys of joining relations. But two NULL
>> + * values are never equal and no two rows from mis-matching partitions can
>> + * join. Hence it's safe to include B.b as partition key expression for
>> + * (AB), even though rows in (AB) are not strictly partitioned by B.b.
>> + */
>>
>> I think that also needs to be reviewed carefully.
>
> The following passage from src/backend/optimizer/README seems highly relevant:
>
> ===
> The planner's treatment of outer join reordering is based on the following
> identities:
>
> 1. (A leftjoin B on (Pab)) innerjoin C on (Pac)
> = (A innerjoin C on (Pac)) leftjoin B on (Pab)
>
> where Pac is a predicate referencing A and C, etc (in this case, clearly
> Pac cannot reference B, or the transformation is nonsensical).
>
> 2. (A leftjoin B on (Pab)) leftjoin C on (Pac)
> = (A leftjoin C on (Pac)) leftjoin B on (Pab)
>
> 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc)
> = A leftjoin (B leftjoin C on (Pbc)) on (Pab)
>
> Identity 3 only holds if predicate Pbc must fail for all-null B rows
> (that is, Pbc is strict for at least one column of B). If Pbc is not
> strict, the first form might produce some rows with nonnull C columns
> where the second form would make those entries null.
> ===
>
> In other words, I think your statement that null is never equal to
> null is a bit imprecise. Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator). The
> argument needs to be made in terms of the formal properties of the
> operator.

[.. some portion clipped .. ]

> The relevant logic is in have_partkey_equi_join:
>
> + /* Skip clauses which are not equality conditions. */
> + if (rinfo->hashjoinoperator == InvalidOid &&
> !rinfo->mergeopfamilies)
> + continue;
>
> Actually, I think the hashjoinoperator test is formally and
> practically unnecessary here; lower down there is a test to see
> whether the partitioning scheme's operator family is a member of
> rinfo->mergeopfamilies, which will certainly fail if we got through
> this test with rinfo->mergeopfamilies == NIL just on the strength of
> rinfo->hashjoinoperator != InvalidOid. So you can just bail out if
> rinfo->mergeopfamilies == NIL. But the underlying point here is that
> the only thing you really know about the function is that it's got to
> be a strategy-3 operator in some btree opclass; if that guarantees
> strictness, then so be it -- but I wasn't able to find anything in the
> code or documentation off-hand that supports that contention, so we
> might need to think a bit more about why (or if) this is guaranteed to
> be true.
>
>> Partition-wise joins
>> may be happy including partition keys from all sides, but
>> partition-wise aggregates may not be, esp. when pushing complete
>> aggregation down to partitions. In that case, rows with NULL partition
>> key, which falls on nullable side of join, will be spread across
>> multiple partitions. Proabably, we should separate nullable and
>> non-nullable partition key expressions.
>
> I don't think I understand quite what you're getting at here. Can you
> spell this out in more detail? To push an aggregate down to
> partitions, you need the grouping key to match the applicable
> partition key, and the partition key shouldn't allow nulls in more
> than one place. Now I think your point may be that outer join
> semantics could let them creep in there, e.g. SELECT b.x, sum(a.y)
> FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a
> good test case for partitionwise aggregate. I'd be inclined to think
> that we should just give up on partitionwise aggregate in such cases;
> it's not worth trying to optimize such a weird query, at least IMHO.
> (Does this sort of case ever happen with joins? I think not, as long
> as the join operator is strict.)
>

I am revisiting NULL equality in the context of merging partition
bounds. In [1] paragraphs following

--
Do not write expression = NULL because NULL is not “equal to” NULL.
(The null value represents an unknown value, and it is not known
whether two unknown values are equal.)

--

seem to indicate that an equality operator should never return true
for two NULL values since it would never know whether two NULL
(unknown) values are same or not. In a paragraph above, Robert stated
that

> In other words, I think your statement that null is never equal to
> null is a bit imprecise. Somebody could certainly create an operator
> that is named "=" which returns true in that case, and then they could
> say, hey, two nulls are equal (when you use that operator). The
> argument needs to be made in terms of the formal properties of the
> operator.

But in case a user has written an = operator which returns true for
two NULL values, per description in [1], that comparison operator is
flawed and
using that operator is going to result in SQL-standard-incompliant
behaviour. I have tried to preserve all the relevant portions of
discussion in this mail. Am I missing something?

[1] https://www.postgresql.org/docs/devel/static/functions-comparison.html

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2017-05-18 08:57:57 Re: Cached plans and statement generalization
Previous Message David Rowley 2017-05-18 08:28:56 Regression in join selectivity estimations when using foreign keys