|From:||Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>|
|To:||Robert Haas <robertmhaas(at)gmail(dot)com>|
|Subject:||Re: Partition-wise join for join between (declaratively) partitioned tables|
|Views:||Raw Message | Whole Thread | Download mbox|
On Fri, Jul 8, 2016 at 12:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I haven't reviewed this code yet due to being busy with 9.6, but I
> think this is a very important query planner improvement with the
> potential for big wins on queries involving large amounts of data.
> Suppose we have a pair of equi-partitioned tables. Right now, if we
> choose to perform a hash join, we'll have to build a giant hash table
> with all of the rows from every inner partition and then probe it for
> every row in every outer partition. If there are few enough inner
> rows that the resultant hash table still fits in work_mem, this is
> somewhat inefficient but not terrible - but if it causes us to have to
> batch the hash join where we otherwise would not need to do so, then
> it really sucks. Similarly, if we decide to merge-join each pair of
> partitions, a partitionwise join may be able to use an internal sort
> on some or all partitions whereas if we had to deal with all of the
> data at the same time we'd need an external sort, possibly multi-pass.
Or we might be able to use indexes directly without need of a MergeAppend.
> And if we choose a nested loop, say over an inner index-scan, we do
> O(outer rows) index probes with this optimization but O(outer rows *
> inner partitions) index probes without it.
> In addition, parallel query can benefit significantly from this kind
> of optimization. Tom recently raised the case of an appendrel where
> every child has a parallel-safe path but not every child has a partial
> path; currently, we can't go parallel in that case, but it's easy to
> see that we could handle it by scheduling the appendrel's children
> across a pool of workers. If we had this optimization, that sort of
> thing would be much more likely to be useful, because it could create
> appendrels where each member is an N-way join between equipartitioned
> tables. That's particularly important right now because of the
> restriction that a partial path must be driven by a Parallel SeqScan,
> but even after that restriction is lifted it's easy to imagine that
> the effective degree of parallelism for a single index scan may be
> limited - so this kind of thing may significantly increase the number
> of workers that a given query can use productively.
The attached patch implements the logic to assess whether two partitioned
tables can be joined using partition-wise join technique described in my
mail on this thread.
Two partitioned relations are considered for partition-wise join if
conditions are met (See build_joinrel_part_info() for details):
1. Both the partitions have same number of partitions, with same number of
partition keys and partitioned by same strategy - range or list.
2. They have matching datatypes for partition keys (partkey_types_match())
3. For list partitioned relations, they have same lists for each pair of
partitions, paired by position in which they appear.
4. For range partitioned relations, they have same bounds for each pair of
partitions, paired by their position when ordered in ascending fashion on
5. There exists an equi-join condition for each pair of partition keys,
by the position in which they appear.
Partition-wise join technique can be applied under more lenient constraints
e.g. joins between tables with different number of partitions but having
bounds/lists for the common partitions. I am planning to defer that to a
version of this feature.
A join executed using partition-wise join technique is itself a relation
partitioned by the similar partitioning scheme as the joining relations with
the partition keys combined from the joining relations.
A PartitionOptInfo (uses name similar to RelOptInfo or IndexOptInfo)
is used to store the partitioning information for a given base or relation.
In build_simple_rel(), we construct PartitionOptInfo structure for the given
base relation by copying the relation's PartitionDesc and PartitionKey
(structures from Amit Langote's patch). While doing so, all the partition
are stored as expressions. The structure also holds the RelOptInfos of the
partition relations. For a join relation, most of the PartitionOptInfo is
copied from either of the joining relations, except the partition keys and
RelOptInfo of partition relations. Partition keys of the join relations are
created by combing partition keys from both the joining relations. The
cosnstruct RelOptInfo for the partition-wise join relations is yet to be
Since the logic to create the paths and RelOptInfos for partition-wise join
relations is not implemented yet, a query which can use partition-wise join
fails with error
"ERROR: the relation was considered for partition-wise join, which is not
supported right now.". It will also print messages to show which of the
can and can not use partition-wise join technique e.g.
"NOTICE: join between relations (b 1) and (b 2) is considered for
partition-wise join." The relations are indicated by their relid in the
"NOTICE: join between relations (b 1) and (b 2) is NOT considered for
These messages are for debugging only, and will be removed once path
logic is implemented.
The patch adds a test partition_join.sql, which has a number of positive and
negative testcases for joins between partitioned tables.
The Postgres Database Company
|Next Message||Ashutosh Bapat||2016-07-19 14:26:11||Re: Partition-wise join for join between (declaratively) partitioned tables|
|Previous Message||Magnus Hagander||2016-07-19 14:00:05||Re: sslmode=require fallback|