Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-07-19 14:26:11
Message-ID: CAFjFpRdiPBKRfUbOiWBzCVbfmmTj0U-7eGvpBaSOygDMJz0BWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Sorry forgot to mention: this patch applies on top of the v7 patches posted
by Amit Langote on 27th June (
https://www.postgresql.org/message-id/81371428-bb4b-1e33-5ad6-8c5c51b52cb7%40lab.ntt.co.jp
).

On Tue, Jul 19, 2016 at 7:41 PM, Ashutosh Bapat <
ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:

>
>
> On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
>
>>
>> I haven't reviewed this code yet due to being busy with 9.6, but I
>> think this is a very important query planner improvement with the
>> potential for big wins on queries involving large amounts of data.
>>
>> Suppose we have a pair of equi-partitioned tables. Right now, if we
>> choose to perform a hash join, we'll have to build a giant hash table
>> with all of the rows from every inner partition and then probe it for
>> every row in every outer partition. If there are few enough inner
>> rows that the resultant hash table still fits in work_mem, this is
>> somewhat inefficient but not terrible - but if it causes us to have to
>> batch the hash join where we otherwise would not need to do so, then
>> it really sucks. Similarly, if we decide to merge-join each pair of
>> partitions, a partitionwise join may be able to use an internal sort
>> on some or all partitions whereas if we had to deal with all of the
>> data at the same time we'd need an external sort, possibly multi-pass.
>>
>
> Or we might be able to use indexes directly without need of a MergeAppend.
>
>
>> And if we choose a nested loop, say over an inner index-scan, we do
>> O(outer rows) index probes with this optimization but O(outer rows *
>> inner partitions) index probes without it.
>>
>> In addition, parallel query can benefit significantly from this kind
>> of optimization. Tom recently raised the case of an appendrel where
>> every child has a parallel-safe path but not every child has a partial
>> path; currently, we can't go parallel in that case, but it's easy to
>> see that we could handle it by scheduling the appendrel's children
>> across a pool of workers. If we had this optimization, that sort of
>> thing would be much more likely to be useful, because it could create
>> appendrels where each member is an N-way join between equipartitioned
>> tables. That's particularly important right now because of the
>> restriction that a partial path must be driven by a Parallel SeqScan,
>> but even after that restriction is lifted it's easy to imagine that
>> the effective degree of parallelism for a single index scan may be
>> limited - so this kind of thing may significantly increase the number
>> of workers that a given query can use productively.
>>
>
> +1.
>
> The attached patch implements the logic to assess whether two partitioned
> tables can be joined using partition-wise join technique described in my
> last
> mail on this thread.
>
> Two partitioned relations are considered for partition-wise join if
> following
> conditions are met (See build_joinrel_part_info() for details):
> 1. Both the partitions have same number of partitions, with same number of
> partition keys and partitioned by same strategy - range or list.
> 2. They have matching datatypes for partition keys (partkey_types_match())
> 3. For list partitioned relations, they have same lists for each pair of
> partitions, paired by position in which they appear.
> 4. For range partitioned relations, they have same bounds for each pair of
> partitions, paired by their position when ordered in ascending fashion on
> the
> upper bounds.
> 5. There exists an equi-join condition for each pair of partition keys,
> paired
> by the position in which they appear.
>
> Partition-wise join technique can be applied under more lenient
> constraints [1]
> e.g. joins between tables with different number of partitions but having
> same
> bounds/lists for the common partitions. I am planning to defer that to a
> later
> version of this feature.
>
> A join executed using partition-wise join technique is itself a relation
> partitioned by the similar partitioning scheme as the joining relations
> with
> the partition keys combined from the joining relations.
>
> A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo)
> structure
> is used to store the partitioning information for a given base or relation.
> In build_simple_rel(), we construct PartitionOptInfo structure for the
> given
> base relation by copying the relation's PartitionDesc and PartitionKey
> (structures from Amit Langote's patch). While doing so, all the partition
> keys
> are stored as expressions. The structure also holds the RelOptInfos of the
> partition relations. For a join relation, most of the PartitionOptInfo is
> copied from either of the joining relations, except the partition keys and
> RelOptInfo of partition relations. Partition keys of the join relations are
> created by combing partition keys from both the joining relations. The
> logic to
> cosnstruct RelOptInfo for the partition-wise join relations is yet to be
> implemented.
>
> Since the logic to create the paths and RelOptInfos for partition-wise join
> relations is not implemented yet, a query which can use partition-wise join
> fails with error
> "ERROR: the relation was considered for partition-wise join, which is not
> supported right now.". It will also print messages to show which of the
> joins
> can and can not use partition-wise join technique e.g.
> "NOTICE: join between relations (b 1) and (b 2) is considered for
> partition-wise join." The relations are indicated by their relid in the
> query.
> OR
> "NOTICE: join between relations (b 1) and (b 2) is NOT considered for
> partition-wise join.".
> These messages are for debugging only, and will be removed once path
> creation
> logic is implemented.
>
> The patch adds a test partition_join.sql, which has a number of positive
> and
> negative testcases for joins between partitioned tables.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2016-07-19 16:00:32 Re: Updating our timezone code in the back branches
Previous Message Ashutosh Bapat 2016-07-19 14:11:18 Re: Partition-wise join for join between (declaratively) partitioned tables