Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-09-18 12:02:06
Message-ID: CAFjFpRfHLrgni-1+C14Nj1R96dje-rGNorgEs1qvGJtqTM6=vQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 16, 2017 at 2:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Sep 15, 2017 at 6:11 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Thanks a lot Robert.
>>
>> Here are rebased patches.
>
> I didn't get quite as much time to look at these today as I would have
> liked, but here's what I've got so far.
>
> Comments on 0001:
>
> - In the RelOptInfo, part_oids is defined in a completely different
> part of the structure than nparts, but you can't use it without nparts
> because you don't know how long it is. I suggest moving the
> definition to just after nparts.
>
> - On the other hand, maybe we should just remove it completely. I
> don't see it in any of the subsequent patches. If it's used by the
> advanced matching code, let's leave it out of 0001 for now and add it
> back after the basic feature is committed.

No, it's not used by advanced partition matching code. It was used by
to match OIDs with the child rels to order those in the array. But now
that we are expanding in EIBO fashion, it is not useful. Should have
removed it earlier. Removed now.

>
> - Similarly, partsupfunc isn't used by the later patches either. It
> seems it could also be removed, at least for now.

It's used by advanced partition matching code to compare bounds. It
will be required by partition pruning patch. But removed for now.

>
> - The comment for partexprs isn't very clear about how the lists
> inside the array work. My understanding is that the lists have as
> many members as the partition key has columns/expressions.

Actually we are doing some preparation for partition-wise join here.
partexprs and nullable_partexprs are used in partition-wise join
implementation patch. I have updated prologue of RelOptInfo structure
with the comments like below

* Note: A base relation will always have only one set of partition keys. But a
* join relation has as many sets of partition keys as the number of joining
* relations. The number of partition keys is given by
* "part_scheme->partnatts". "partexprs" and "nullable_partexprs" are arrays
* containing part_scheme->partnatts elements. Each element of the array
* contains a list of partition key expressions. For a base relation each list
* contains only one expression. For a join relation each list contains at
* most as many expressions as the joining relations. The expressions in a list
* at a given position in the array correspond to the partition key at that
* position. "partexprs" contains partition keys of non-nullable joining
* relations and "nullable_partexprs" contains partition keys of nullable
* joining relations. For a base relation only "partexprs" is populated.

Let me know this looks fine. The logic to match the partition keys of
joining relations in have_partkey_equi_join() and
match_expr_to_partition_keys() becomes simpler if we arrange the
partition key expressions as array indexed by position of partition
key and each array element as list of partition key expressions at
that position.

partition pruning might need partexprs look up relevant quals, but
nullable_partexprs doesn't have any use there. So may be we should add
nullable_partexpr to RelOptInfo as part of 0002 (partition-wise join
implementation) instead of 0001. What do you think?

>
> - I'm not entirely sure whether maintaining partexprs and
> nullable_partexprs is the right design. If I understand correctly,
> whether or not a partexpr is nullable is really a per-RTI property,
> not a per-expression property. You could consider something like
> "Relids nullable_rels".

That's true. However in order to decide whether an expression falls on
nullable side of a join, we will need to call pull_varnos() on it and
check the output against nullable_rels. Separating the expressions
themselves avoids that step.

>
> Comments on 0002:
>
> - The relationship between deciding to set partition scheme and
> related details and the configured value of enable_partition_wise_join
> needs some consideration. If the only use of the partition scheme is
> partition-wise join, there's no point in setting it even for a baserel
> unless enable_partition_wise_join is set -- but if there are other
> important uses for that data, such as Amit's partition pruning work,
> then we might want to always set it. And similarly for a join: if the
> details are only needed in the partition-wise join case, then we only
> need to set them in that case, but if there are other uses, then it's
> different. If it turns out that setting these details for a baserel
> is useful in other cases but that it's only for a joinrel in the
> partition-wise join case, then the patch gets it exactly right. But
> is that correct? I'm not sure.

Partition scheme contains the information about data types of
partition keys, which is required to compare partition bounds.
Partition pruning will need to compare constants with partition bounds
and hence will need information contained in partition scheme. So, we
will need to set it for base relations whether or not partition-wise
join is enabled.

>
> - The naming of enable_partition_wise_join might also need some
> thought. What happens when we also have partition-wise aggregate?
> What about the proposal to strength-reduce MergeAppend to Append --
> would that use this infrastructure? I wonder if we out to call this
> enable_partition_wise or enable_partition_wise_planning to make it a
> bit more general. Then, too, I've never really liked having
> partition_wise in the GUC name because it might make someone think
> that it makes you partitions have a lot of wisdom. Removing the
> underscore might help: partitionwise. Or maybe there is some whole
> different name that would be better. If anyone wants to bikeshed,
> now's the time.

partitions having a lot of wisdom would be wise_partitions rather than
partition_wise ;).

If partition-wise join is disabled, partition-wise aggregates,
strength reduction of MergeAppend won't be possible on a join tree,
but those will be possible on a base relation. Even if partition-wise
join enabled, one may want to disable other partition-wise
optimizations individually. So, they are somewhat independent
switches. I don't think we should bundle all of those into one.
Whatever names we choose for those GUCs, I think they should have same
naming convention e.g. "partition_wise_xyz". I am open to suggestions
about the names.

>
> - It seems to me that build_joinrel_partition_info() could be
> simplified a bit. One thing is that list_copy() is perfectly capable
> of handling a NIL input, so there's no need to test for that before
> calling it.

partexprs may be NULL for FULL JOIN and nullable_partexprs may be NULL
when there is no nullable relation. So, we have to check existence of
those arrays before accessing lists containing partition key
expressions. list_copy() is being called on individual array elements
and "if" conditions check for the existence of array.

The functions might have become complicated because I am using
outer/inner_partexprs to hold one of the lists and partexprs contains
the array of lists. We may use better named, but I don't have any
better ideas right now. Will think about them.

We could simplify that function according to your suggestion of
nullable_relids. Basically partexprs then contains partition key
expressions all relations nullable and non-nullable. nullable_relids +
pull_varnos() tells us which of those fall on nullable side and which
ones don't. Is this how you are thinking of simplifying it? If we go
with this scheme, again nullable_relids will not be useful for
partition pruning, so may be we should add it as part of 0002
(partition-wise join implementation) instead of 0001.

>
> Comments on 0003:
>
> - Instead of reorganizing add_paths_to_append_rel as you did, could
> you just add an RTE_JOIN case to the switch? Not sure if there's some
> problem with that idea, but it seems like it might come out nicer.

RTE_JOIN is created only for joins specified using JOIN clause i.e
syntactic joins. The joins created during query planner like rel1,
rel2, rel3 do not have RTE_JOIN. So, we can't use RTE_JOIN there.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Oliver Ford 2017-09-18 12:39:25 Re: Add Roman numeral conversion to to_number
Previous Message Robert Haas 2017-09-18 11:54:13 Re: Is it time to kill support for very old servers?