Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2017-01-02 12:32:41
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Tue, Dec 27, 2016 at 11:01 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> PFA patch rebased after partitioning code was committed.
> On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Hi Robert,
>> Sorry for delayed response.
>> The attached patch implements following ideas:
>> 1. At the time of creating paths - If the joining relations are both
>> partitioned and join can use partition-wise join, we create paths for
>> few child-joins. Similar to inheritance relations
>> (set_append_rel_pathlist()), we collect paths with similar properties
>> from all sampled child-joins and create one PartitionJoinPath with
>> each set of paths. The cost of the PartitionJoinPath is obtained by
>> multiplying the sum of costs of paths in the given set by the ratio of
>> (number of rows estimated in the parent-join/sum of rows in
>> child-joins).
>> 2. If the PartitionJoinPath emerges as the best path, we create paths
>> for each of the remaining child-joins. Then we collect paths with
>> properties same as the given PartitionJoinPath, one from each
>> child-join. These paths are converted into plans and a Merge/Append
>> plan is created combing these plans. The paths and plans for
>> child-join are created in a temporary memory context. The final plan
>> for each child-join is copied into planner's context and the temporary
>> memory context is reset.
>> Right now, we choose 1% or 1 (whichever is higher) child-joins to base
>> PartitionJoinPath costs on.
>> Memory consumption
>> -----------------------------
>> I tested a 5-way self-join for a table with 1000 partitions, each
>> partition having 1M rows. The memory consumed in standard_planner()
>> was measured with some granular tracking
>> (mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
>> consumed total of 289MB memory which is approx 6.6 times more than
>> non-partition-wise join which consumed 44MB. That's much better than
>> the earlier 16 times consumption for 5-way join with 100 partitions.
>> The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
>> SpecialJoinInfos for child-joins (64MB), restrictlist translation
>> (92MB), paths for sampled child-joins (1.5MB), building targetlists
>> for child-joins (7MB).
> In the earlier implementation, a given clause which was applicable to
> multiple join orders was getting translated as many times as the join
> orders it was applicable in. I changed RestrictInfo for parent to
> store a list of RestrictInfos applicable to children to avoid multiple
> translations.
> My earlier patch created the child-join plans in a temporary context
> and then copied them into planner context since the translated clauses
> were allocated memory in temporary memory context then. Now that they
> are stored in planner's context, we can directly create the plan in
> the planner's context.
> Third, I added code to free up child SpecialJoinInfos after using those.
> As a result the total memory consumption now is 192MB, which is approx
> 4.4 times the memory consumed during planning in case of
> non-partition-wise join.
>> Choosing representative child-joins:
>> --------------------------------------------------
>> There's another angle to choosing representative child joins. In a
>> partitioned N-way join, different joins covering different subsets of
>> N relations, will have different size distributions across the
>> partitions. This means that the child-joins costed for (N-k) joins,
>> may be different for those required for (N-k+1) joins. With a factor
>> of 1% sampling, N is such that a child-join participates in 100 joins,
>> we will end up creating paths for all partitions before creating
>> PartitionJoinPaths for the final N-way join. Hopefully that will be a
>> rare case and usually we will end up using paths already created. We
>> can not avoid creating PartitionJoinPaths for subset joins, as there
>> might be cases when partition-wise join will be optimal for an N-k way
>> join but not for N-way join. We may avoid this if we choose
>> representative child-joins based on their positions, in which case, we
>> may end up with some or all of those being empty and thus skewing the
>> costs heavily.
>> Partial paths
>> -----------------
>> AFAIU, we create partial paths for append relation, when all the
>> children have partial paths. Unlike parameterized paths or path with
>> pathkeys, there is no way to create a partial path for a normal path.
>> This means that unless we create paths for all child-joins, we can not
>> create partial paths for appendrel comprising of child-joins, and thus
>> can not use parallel query right now. This may not be that bad, since
>> it would be more efficient to run each child-join in a separate
>> worker, rather than using multiple workers for a single child-join.
> This still applies.
>> regression tests
>> ----------------------
>> I observed that for small relations (1000 rows in each partition and
>> 100 partitions), the size estimates in append relations and sum of
>> those in child relations are very different. As a result, the
>> extrapolated costs for PartitionJoinPaths as described above, are way
>> higher than costs of join of appends (or even append of joins if we
>> are to create paths for all child-joins). Thus with this approach, we
>> choose partition-wise join for large number of partitions with large
>> data (e.g. 1000 partitions with 1M rows each). These are certainly the
>> cases when partition-wise join is a big win. I have not tried to find
>> out a threshold above which partition-wise join gets chosen with above
>> approach, but it's going to be a larger threshold. That makes writing
>> regression tests difficult, as those will require large data. So, we
>> have to find a way so that we can test partition-wise join with
>> smaller data. There are few possibilities like 1. convert the fraction
>> of representative child-joins into GUC and setting it to 100% would
>> start choosing partition-wise joins for tables with a few hundred rows
>> per partition, like it did in earlier approach, 2. provide a way to
>> force partition-wise join whenever possible, by say costing
>> partition-wise joins much lesser than non-partition-wise join when a
>> GUC is set (e.g. enable_partition_wise_join with values always, never,
>> optimal or something like that).
> For now I have added a float GUC partition_wise_plan_weight. The
> partition-wise join cost derived from the samples is multiplied by
> this GUC and set as the cost of ParitionJoinPath. A value of 1 means
> that the cost derived from the samples are used as is. A value higher
> than 1 discourages use of partition-wise join and that lower than 1
> encourages use of partition-wise join. I am not very keen on keeping
> this GUC, in this form. But we need some way to run regression with
> smaller data.
> For now I have disabled partition-wise join for multi-level
> partitions. I will post a patch soon with that enabled.

PFA the patch (pg_dp_join_v6.patch) with some bugs fixed and rebased
on the latest code.

Also, PFA patch to support partition-wise join between multi-level
partitioned tables. I copied the Amit Langote's patch for translating
partition hierarchy into inheritance hierarchy and added code to
support partition-wise join. You had expressed some concerns about
Amit's approach in [1], but that discussion is still open. So, I
haven't merged those changes to partition-wise join patch. We may
continue to work on it as separate patch or I can include it in
partition-wise join main patch.

BTW, INSERT into multi-level partitioned tables is crashing with
latest head. The issue was reported in [2]. Because of that
multi_level_partition_join test crashes in pg_dp_join_v6.patch.
Intestingly the crash vanishes when we apply patch supporting
mult-level partition-wise join.


Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
pg_dp_join_v6.patch application/x-download 416.0 KB
multi_level_partition_join.patch application/x-download 100.1 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-01-02 12:50:50 Re: proposal: session server side variables
Previous Message Simon Riggs 2017-01-02 12:20:46 Re: Replication slot xmin is not reset if HS feedback is turned off while standby is shut down