Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-07-07 18:41:58
Message-ID: CA+TgmoZpOMTb=m-EqC=DNDHQy9NC=TMwytqtgVaz_0u-ZPjnWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 15, 2016 at 3:25 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Amit Langote is working on supporting declarative partitioning in PostgreSQL
> [1]. I have started working on supporting partition-wise join. This mail
> describes very high level design and some insight into the performance
> improvements.
>
> An equi-join between two partitioned tables can be broken down into
> pair-wise join between their partitions. This technique is called
> partition-wise join. Partition-wise joins between similarly partitioned
> tables with equi-join condition can be efficient because [2]
> 1. Each provably non-empty partition-wise join smaller. All such joins
> collectively might be more efficient than the join between their parent.
> 2. Such joins are able to exploit properties of partitions like indexes,
> their storage etc.
> 3. An N-way partition-wise join may have different efficient join orders
> compared to the efficient join order between the parent tables.
>
> A partition-wise join is processed in following stages [2], [3].
> 1. Applicability testing: This phase checks if the join conditions match the
> partitioning scheme. A partition-wise join is efficient if there is an
> equi-join on the partition keys. E.g. join between tables R and S
> partitioned by columns a and b resp. can be broken down into partition-wise
> joins if there exists a join condition is R.a = S.b. Or in other words the
> number of provably non-empty partition-wise joins is O(N) where N is the
> number of partitions.
>
> 2. Matching: This phase determines which joins between the partitions of R
> and S can potentially produce tuples in the join and prunes empty joins
> between partitions.
>
> 3. Clustering: This phase aims at reducing the number of partition-wise
> joins by clubbing together partitions from joining relations. E.g. clubbing
> multiple partitions from either of the partitioned relations which can join
> to a single partition from the other partitioned relation.
>
> 4. Path/plan creation: This phase creates multiple paths for each
> partition-wise join. It also creates Append path/s representing the union of
> partition-wise joins.
>
> The work here focuses on a subset of use-cases discussed in [2]. It only
> considers partition-wise join for join between similarly partitioned tables
> with same number of partitions with same properties, thus producing at most
> as many partition-wise joins as there are partitions. It should be possible
> to apply partition-wise join technique (with some special handling for OUTER
> joins) if both relations have some extra partitions with non-overlapping
> partition conditions, apart from the matching partitions. But I am not
> planning to implement this optimization in the first cut.

I haven't reviewed this code yet due to being busy with 9.6, but I
think this is a very important query planner improvement with the
potential for big wins on queries involving large amounts of data.

Suppose we have a pair of equi-partitioned tables. Right now, if we
choose to perform a hash join, we'll have to build a giant hash table
with all of the rows from every inner partition and then probe it for
every row in every outer partition. If there are few enough inner
rows that the resultant hash table still fits in work_mem, this is
somewhat inefficient but not terrible - but if it causes us to have to
batch the hash join where we otherwise would not need to do so, then
it really sucks. Similarly, if we decide to merge-join each pair of
partitions, a partitionwise join may be able to use an internal sort
on some or all partitions whereas if we had to deal with all of the
data at the same time we'd need an external sort, possibly multi-pass.
And if we choose a nested loop, say over an inner index-scan, we do
O(outer rows) index probes with this optimization but O(outer rows *
inner partitions) index probes without it.

In addition, parallel query can benefit significantly from this kind
of optimization. Tom recently raised the case of an appendrel where
every child has a parallel-safe path but not every child has a partial
path; currently, we can't go parallel in that case, but it's easy to
see that we could handle it by scheduling the appendrel's children
across a pool of workers. If we had this optimization, that sort of
thing would be much more likely to be useful, because it could create
appendrels where each member is an N-way join between equipartitioned
tables. That's particularly important right now because of the
restriction that a partial path must be driven by a Parallel SeqScan,
but even after that restriction is lifted it's easy to imagine that
the effective degree of parallelism for a single index scan may be
limited - so this kind of thing may significantly increase the number
of workers that a given query can use productively.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-07-07 18:43:31 Re: gettimeofday is at the end of its usefulness?
Previous Message Robert Haas 2016-07-07 18:21:08 Re: Reviewing freeze map code