Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-12-27 05:31:36
Message-ID: CAFjFpRe66z+w9+dnAkWGiaB1CU2CUQsLGsqzHzYBoA=KJFf+PQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

PFA patch rebased after partitioning code was committed.

On Thu, Dec 1, 2016 at 4:32 PM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> Hi Robert,
> Sorry for delayed response.
>
> The attached patch implements following ideas:
> 1. At the time of creating paths - If the joining relations are both
> partitioned and join can use partition-wise join, we create paths for
> few child-joins. Similar to inheritance relations
> (set_append_rel_pathlist()), we collect paths with similar properties
> from all sampled child-joins and create one PartitionJoinPath with
> each set of paths. The cost of the PartitionJoinPath is obtained by
> multiplying the sum of costs of paths in the given set by the ratio of
> (number of rows estimated in the parent-join/sum of rows in
> child-joins).
>
> 2. If the PartitionJoinPath emerges as the best path, we create paths
> for each of the remaining child-joins. Then we collect paths with
> properties same as the given PartitionJoinPath, one from each
> child-join. These paths are converted into plans and a Merge/Append
> plan is created combing these plans. The paths and plans for
> child-join are created in a temporary memory context. The final plan
> for each child-join is copied into planner's context and the temporary
> memory context is reset.
>
> Right now, we choose 1% or 1 (whichever is higher) child-joins to base
> PartitionJoinPath costs on.
>
> Memory consumption
> -----------------------------
> I tested a 5-way self-join for a table with 1000 partitions, each
> partition having 1M rows. The memory consumed in standard_planner()
> was measured with some granular tracking
> (mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
> consumed total of 289MB memory which is approx 6.6 times more than
> non-partition-wise join which consumed 44MB. That's much better than
> the earlier 16 times consumption for 5-way join with 100 partitions.
>
> The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
> SpecialJoinInfos for child-joins (64MB), restrictlist translation
> (92MB), paths for sampled child-joins (1.5MB), building targetlists
> for child-joins (7MB).
>

In the earlier implementation, a given clause which was applicable to
multiple join orders was getting translated as many times as the join
orders it was applicable in. I changed RestrictInfo for parent to
store a list of RestrictInfos applicable to children to avoid multiple
translations.

My earlier patch created the child-join plans in a temporary context
and then copied them into planner context since the translated clauses
were allocated memory in temporary memory context then. Now that they
are stored in planner's context, we can directly create the plan in
the planner's context.

Third, I added code to free up child SpecialJoinInfos after using those.

As a result the total memory consumption now is 192MB, which is approx
4.4 times the memory consumed during planning in case of
non-partition-wise join.

>
> Choosing representative child-joins:
> --------------------------------------------------
> There's another angle to choosing representative child joins. In a
> partitioned N-way join, different joins covering different subsets of
> N relations, will have different size distributions across the
> partitions. This means that the child-joins costed for (N-k) joins,
> may be different for those required for (N-k+1) joins. With a factor
> of 1% sampling, N is such that a child-join participates in 100 joins,
> we will end up creating paths for all partitions before creating
> PartitionJoinPaths for the final N-way join. Hopefully that will be a
> rare case and usually we will end up using paths already created. We
> can not avoid creating PartitionJoinPaths for subset joins, as there
> might be cases when partition-wise join will be optimal for an N-k way
> join but not for N-way join. We may avoid this if we choose
> representative child-joins based on their positions, in which case, we
> may end up with some or all of those being empty and thus skewing the
> costs heavily.
>
> Partial paths
> -----------------
> AFAIU, we create partial paths for append relation, when all the
> children have partial paths. Unlike parameterized paths or path with
> pathkeys, there is no way to create a partial path for a normal path.
> This means that unless we create paths for all child-joins, we can not
> create partial paths for appendrel comprising of child-joins, and thus
> can not use parallel query right now. This may not be that bad, since
> it would be more efficient to run each child-join in a separate
> worker, rather than using multiple workers for a single child-join.

This still applies.

>
> regression tests
> ----------------------
> I observed that for small relations (1000 rows in each partition and
> 100 partitions), the size estimates in append relations and sum of
> those in child relations are very different. As a result, the
> extrapolated costs for PartitionJoinPaths as described above, are way
> higher than costs of join of appends (or even append of joins if we
> are to create paths for all child-joins). Thus with this approach, we
> choose partition-wise join for large number of partitions with large
> data (e.g. 1000 partitions with 1M rows each). These are certainly the
> cases when partition-wise join is a big win. I have not tried to find
> out a threshold above which partition-wise join gets chosen with above
> approach, but it's going to be a larger threshold. That makes writing
> regression tests difficult, as those will require large data. So, we
> have to find a way so that we can test partition-wise join with
> smaller data. There are few possibilities like 1. convert the fraction
> of representative child-joins into GUC and setting it to 100% would
> start choosing partition-wise joins for tables with a few hundred rows
> per partition, like it did in earlier approach, 2. provide a way to
> force partition-wise join whenever possible, by say costing
> partition-wise joins much lesser than non-partition-wise join when a
> GUC is set (e.g. enable_partition_wise_join with values always, never,
> optimal or something like that).
>

For now I have added a float GUC partition_wise_plan_weight. The
partition-wise join cost derived from the samples is multiplied by
this GUC and set as the cost of ParitionJoinPath. A value of 1 means
that the cost derived from the samples are used as is. A value higher
than 1 discourages use of partition-wise join and that lower than 1
encourages use of partition-wise join. I am not very keen on keeping
this GUC, in this form. But we need some way to run regression with
smaller data.

For now I have disabled partition-wise join for multi-level
partitions. I will post a patch soon with that enabled.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
pg_dp_join_v5.patch binary/octet-stream 418.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-12-27 06:58:25 Re: Write Ahead Logging for Hash Indexes
Previous Message Michael Paquier 2016-12-27 05:09:05 Re: Potential data loss of 2PC files