Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-12-01 11:02:34
Message-ID: CAFjFpRcZ_M3-JxoiDkdoPS+-9Cok4ux9Si+4drcRL-62af=jWw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Robert,
Sorry for delayed response.

The attached patch implements following ideas:
1. At the time of creating paths - If the joining relations are both
partitioned and join can use partition-wise join, we create paths for
few child-joins. Similar to inheritance relations
(set_append_rel_pathlist()), we collect paths with similar properties
from all sampled child-joins and create one PartitionJoinPath with
each set of paths. The cost of the PartitionJoinPath is obtained by
multiplying the sum of costs of paths in the given set by the ratio of
(number of rows estimated in the parent-join/sum of rows in
child-joins).

2. If the PartitionJoinPath emerges as the best path, we create paths
for each of the remaining child-joins. Then we collect paths with
properties same as the given PartitionJoinPath, one from each
child-join. These paths are converted into plans and a Merge/Append
plan is created combing these plans. The paths and plans for
child-join are created in a temporary memory context. The final plan
for each child-join is copied into planner's context and the temporary
memory context is reset.

Right now, we choose 1% or 1 (whichever is higher) child-joins to base
PartitionJoinPath costs on.

Memory consumption
-----------------------------
I tested a 5-way self-join for a table with 1000 partitions, each
partition having 1M rows. The memory consumed in standard_planner()
was measured with some granular tracking
(mem_usage_func_wise_measurement_slabwise.patch). Partition-wise join
consumed total of 289MB memory which is approx 6.6 times more than
non-partition-wise join which consumed 44MB. That's much better than
the earlier 16 times consumption for 5-way join with 100 partitions.

The extra 245MB memory was consumed by child-join RelOptInfos (48MB),
SpecialJoinInfos for child-joins (64MB), restrictlist translation
(92MB), paths for sampled child-joins (1.5MB), building targetlists
for child-joins (7MB).

In order to choose representative child-joins based on the sizes of
child-joins, we need to create all the child-join RelOptInfos. In
order to estimate sizes of child-joins, we need to create
SpecialJoinInfos and restrictlists for at least one join order for all
child-joins. For every representative child-join, we need to create
SpecialJoinInfo and restrictlist for all join orders for that
child-join. We might be able to save of restrictlist translation, if
we create restrict lists from joininfo similar to parent joins. I
haven't tried that yet.

Choosing representative child-joins:
--------------------------------------------------
There's another angle to choosing representative child joins. In a
partitioned N-way join, different joins covering different subsets of
N relations, will have different size distributions across the
partitions. This means that the child-joins costed for (N-k) joins,
may be different for those required for (N-k+1) joins. With a factor
of 1% sampling, N is such that a child-join participates in 100 joins,
we will end up creating paths for all partitions before creating
PartitionJoinPaths for the final N-way join. Hopefully that will be a
rare case and usually we will end up using paths already created. We
can not avoid creating PartitionJoinPaths for subset joins, as there
might be cases when partition-wise join will be optimal for an N-k way
join but not for N-way join. We may avoid this if we choose
representative child-joins based on their positions, in which case, we
may end up with some or all of those being empty and thus skewing the
costs heavily.

Partial paths
-----------------
AFAIU, we create partial paths for append relation, when all the
children have partial paths. Unlike parameterized paths or path with
pathkeys, there is no way to create a partial path for a normal path.
This means that unless we create paths for all child-joins, we can not
create partial paths for appendrel comprising of child-joins, and thus
can not use parallel query right now. This may not be that bad, since
it would be more efficient to run each child-join in a separate
worker, rather than using multiple workers for a single child-join.

regression tests
----------------------
I observed that for small relations (1000 rows in each partition and
100 partitions), the size estimates in append relations and sum of
those in child relations are very different. As a result, the
extrapolated costs for PartitionJoinPaths as described above, are way
higher than costs of join of appends (or even append of joins if we
are to create paths for all child-joins). Thus with this approach, we
choose partition-wise join for large number of partitions with large
data (e.g. 1000 partitions with 1M rows each). These are certainly the
cases when partition-wise join is a big win. I have not tried to find
out a threshold above which partition-wise join gets chosen with above
approach, but it's going to be a larger threshold. That makes writing
regression tests difficult, as those will require large data. So, we
have to find a way so that we can test partition-wise join with
smaller data. There are few possibilities like 1. convert the fraction
of representative child-joins into GUC and setting it to 100% would
start choosing partition-wise joins for tables with a few hundred rows
per partition, like it did in earlier approach, 2. provide a way to
force partition-wise join whenever possible, by say costing
partition-wise joins much lesser than non-partition-wise join when a
GUC is set (e.g. enable_partition_wise_join with values always, never,
optimal or something like that).

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
mem_usage_func_wise_measurement_slabwise.patch binary/octet-stream 12.3 KB
pg_dp_join_v5.patch binary/octet-stream 457.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Meskes 2016-12-01 11:27:44 Re: Fix typo in ecpg.sgml
Previous Message Mithun Cy 2016-12-01 10:00:54 Re: Patch: Implement failover on libpq connect level.