Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-10-28 07:09:54
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Tue, Oct 18, 2016 at 9:09 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>>> Have you tested the effect of this patch on planner memory consumption
>>> with multi-way joins between tables with many partitions? If you
>>> haven't, you probably should. (Testing runtime would be good, too.)
>>> Does it grow linearly? Quadratically? Exponentially? Minor leaks
>>> don't matter, but if we're generating too much garbage we'll have to
>>> make sure it gets cleaned up soon enough to prevent runaway memory
>>> usage.
>> I tried to check memory usage with various combinations of number of
>> partitions and number of relations being joined. For higher number of
>> relations being joined like 10 with 100 partitions, OOM killer kicked
>> in during the planning phase. I am suspecting
>> adjust_partitionrel_attrs() (changed that name to
>> adjust_join_appendrel_attrs() to be in sync with
>> adjust_appendrel_attrs()) to be the culprit. It copies expression
>> trees every time for joining two children. That's an exponentially
>> increasing number as the number of legal joins increases
>> exponentially. I am still investigating this.
> I think the root of this problem is that the existing paths shares a
> lot more substructure than the ones created by the new code. Without
> a partition-wise join, the incremental memory usage for a joinrel
> isn't any different whether the underlying rel is partitioned or not.
> If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
> be pointing to some kind of Scan. But the join itself creates exactly
> the same amount of new stuff regardless of what's underneath it. With
> partitionwise join, that ceases to be true. Every joinrel - and the
> number of those grows exponentially in the number of baserels, IICU -
> needs its own list of paths for every member rel. So if a
> non-partition-wise join created X paths, and there are K partitions, a
> partition-wise join creates X * K paths. That's a lot.
> Although we might be able to save some memory by tightening things up
> here and there - for example, right now the planner isn't real smart
> about recycling paths that are evicted by add_path(), and there's
> probably other wastage as well - I suspect that what this shows is
> that the basic design of this patch is not going to be viable.
> Intuitively, it's often going to be the case that we want the "same
> plan" for every partition-set. That is, if we have A JOIN B ON A.x =
> B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
> partitioned, then the result should be an Append plan with 100 join
> plans under it, and all 100 of those plans should be basically mirror
> images of each other. Of course, that's not really right in general:
> for example, it could be that A1 is big and A2 is small while B1 is
> small and B2 is big, so that the right plan for (A1 JOIN B1) and for
> (A2 JOIN B2) are totally different from each other. But in many
> practical cases we'll want to end up with a plan of precisely the same
> shape for all children, and the current design ignores this, expending
> both memory and CPU time to compute essentially-equivalent paths
> across all children.

I think there are going to be two kinds of partitioning use-cases.
First, carefully hand-crafted by DBAs so that every partition is
different from other and so is every join between two partitions.
There will be lesser number of partitions, but creating paths for each
join between partitions will be crucial from performance point of
view. Consider, for example, systems which use partitions to
consolidate results from different sources for analytical purposes or
sharding. If we consider various points you have listed in [1] as to
why a partition is equivalent to a table, each join between partitions
is going to have very different characteristics and thus deserves a
set of paths for its own. Add to that possibility of partition pruning
or certain conditions affecting particular partitions, the need for
detailed planning evident.

The other usage of partitioning is to distribute the data and/or
quickly eliminate the data by partition pruning. In such case, all
partitions of a given table will have very similar properties. There
is a large chance that we will end up having same plans for every
partition and for joins between partitions. In such cases, I think it
suffices to create paths for just one or may be a handful partitions
of join and repeat that plan for other partitions of join. But in such
cases it also makes sense to have a light-weight representation for
partitions as compared to partitions being a full-fledged tables. If
we have such a light-weight representation, we may not even create
RelOptInfos representing joins between partitions, and different paths
for each join between partitions.

> One way of attacking this problem is to gang together partitions which
> are equivalent for planning purposes, as discussed in the paper "Join
> Optimization Techniques for Partitioned Tables" by Herodotou, Borisov,
> and Babu. However, it's not exactly clear how to do this: we could
> gang together partitions that have the same index definitions, but the
> sizes of the heaps, the sizes of their indexes, and the row counts
> will vary from one partition to the next, and any of those things
> could cause the plan choice to be different for one partition vs. the
> next. We could try to come up with heuristics for when those things
> are likely to be true. For example, suppose we compute the set of
> partitions such that all joined relations have matching index
> definitions on all tables; then, we take the biggest table in the set
> and consider all tables more than half that size as part of one gang.
> The biggest table becomes the leader and we compute partition-wise
> paths for just that partition; the other members of the gang will
> eventually get a plan that is of the same shape, but we don't actually
> create it that plan until after scan/join planning is concluded.

Section 5 of that paper talks about clustering partitions together for
joining, only when there is 1:m OR n:1 partition matching for join. In
such a case, it clusters all the partitions from one relation that are
all joined with a single partition of the other relation. But I think
your idea to gang up partitions with similar properties may reduce the
number of paths we create but as you have mentioned how to gang them
up is not very clear. There are just too many factors like
availability of the indexes, sizes of tables, size of intermediate
results etc. which make it difficult to identify the properties used
for ganging up. Even after we do that, in the worst case, we will
still end up creating paths for all partitions of all joins, thus
causing increase in paths proportionate to the number of partitions.

In the section 6.3, the paper mentions that the number of paths
retained are linear in the number of child joins per parent join. So,
it's clear that the paper never considered linear increase in the
paths to be a problem or at least a problem that that work had to
solve. Now, it's surprising that their memory usage increased by 7% to
10%. But 1. they might be measuring total memory and not the memory
used by the planner and they experimented with PostgreSQL 8.3.7, which
probably tried much less number of paths than the current optimizer.

> Another idea is to try to reduce peak memory usage by performing
> planning separately for each partition-set. For example, suppose we
> decide to do a partition-wise join of A, B, and C. Initially, this
> gets represented as a PartitionJoinPath tree, like this:
> PartitionJoinPath
> -> AppendPath for A
> -> PartitionJoinPath
> -> AppendPath for B
> -> AppendPath for C
> Because we haven't created individual join paths for the members, this
> doesn't use much memory. Somehow, we come up with a cost for the
> PartitionJoinPath; it probably won't be entirely accurate. Once
> scan/join planning is concluded, if our final path contains a
> PartitionJoinPath, we go back and loop over the partitions.

A typical join tree will be composite: some portion partitioned and
some portion unpartitioned or different portions partitioned by
different partition schemes. In such case, inaccurate costs for
PartitionJoinPath, can affect the plan heavily, causing a suboptimal
path to be picked. Assuming that partitioning will be useful for large
sets of data, choosing a suboptimal plan can be more dangerous than
consuming memory for creating paths.

If we could come up with costs for PartitionJoinPath using some
methods of interpolation, say by sampling few partitions and then
extrapolating their costs for entire PartitionJoinPath, we can use
this method. But unless the partitions have very similar
characteristics or have such characteristics that costs can be guessed
based on the differences between the characteristics, I do not see how
that can happen. For example, while costing a PartitionJoinPath with
pathkeys, the costs will change a lot based on whether underlying
relations have indexes, or which join methods are used, which in turn
is based on properties on the partitions. Same is the case for paths
with parameterization. All such paths are important when a partitioned
join relation joins with other unpartitioned relation or a partitioned
relation with different partitioning scheme.

When each partition of base relation being joined has different
properties, the cost for join between one set of partitions can differ
from join between other set of partitions. Not only that, the costs
for various properties of resultant paths like pathkeys,
parameterization can vary a lot, depending upon the available indexes
and estimates of rows for each join. So, we need to come up with these
cost estimates separately for each join between partitions to come up
with cost of each PartitionJoinPath. If we have to calculate those
costs to create PartitionJoinPath, we better save them in paths rather
than recalculating them in the second round of planning for joins
between partitions.

> For each
> partition, we switch to a new memory context, perform planning, copy
> the best path and its substructure back to the parent context, and
> then reset the context.

This could be rather tricky. It assumes that all the code that creates
paths for joins, should not allocate any memory which is linked to
some object in a context that lives longer than the path creation
context. There is some code like create_join_clause() or
make_canonical_pathkey(), which carefully chooses which memory context
to allocate memory in. But can we ensure it always? postgres_fdw for
example allocates memory for PgFdwRelationInfo in current memory
context and attaches it in RelOptInfo, which should be in the
planner's original context. So, if we create a new memory context for
each partition, fpinfos would be invalidated when those contexts are
released. Not that, we can not enforce some restriction on the memory
usage while planning, it's hard to enforce it and bugs arising from it
may go unnoticed. GEQO planner might have its own problems with this
approach. Third party FDWs will pose a problem.

A possible solution would be to keep the track of used paths using a
reference count. Once the paths for given join tree are created, free
up the unused paths by traversing pathlist in each of the RelOptInfos.
Attached patch has a prototype implementation for the same. There are
some paths which are not linked to RelOptInfos, which need a bit
different treatment, but they can be handled too.

> In that way, peak memory usage only grows by
> about a factor of 2 rather than a factor equal to the partition count,
> because we don't need to keep every possibly-useful path for every
> partition all at the same time, but rather every possibly-useful path
> for a single partition.
> Maybe there are other ideas but I have a feeling any way you slice it
> this is going to be a lot of work.

For the case of carefully hand-crafted partitions, I think, users
would expect the planner to use really the best plan and thus may be
willing to accommodate for increased memory usage. Going by any
approach that does not create the paths for joins between partitions
is not guaranteed to give the best plan. Users willing to provide
increased memory will be unhappy if we do not give them the best path.

The user who creates hundreds of partitions, will ideally be using
pretty powerful servers with a lot of memory. On such servers, the
linear increase in memory for paths may not be as bad as you are
portraying above, as long as its producing the best plan.

Just joining partitioned tables with hundreds of partitions does not
increase the number of paths. Number of paths increases when two
partitioned tables with similar partitioning scheme are joined with
equality condition on partition key. Unless we consider
repartitioning, how many of the joining relations share same
partitioning scheme? Section 8.6 mentions, "no TPC-H query plan,
regardless of the partitioning scheme, contains n-way child joins for
n >= 4. Maximum partitions that the paper mentions is 168 (Table 3).
My VM which has 8GB RAM and 4 cores handled that case pretty well. We
may add logic to free up space used by useless paths post-join to free
up some memory for next stages of query execution.

There will still be users, for whom the increase in the memory usage
is unexpected. Those will need to be educated or for them we might
take heuristic PartitionJoinPath based approach discussed above. But I
don't think that heuristic approach should be the default case. May be
we should supply a GUC which can switch between the approaches.

Some ideas for GUCs are 1. delay_partition_wise_join - when ON uses
the heuristic approach of PartitionJoinPath.
2. A GUC similar to join_collapse_limit may be used to limit the
number of partitioned relations being joined using partition-wise join
technique. A value of 1, indicates enable_partition_wise_join = false.
So, we may replace enable_partition_wise_join withe this GUC.
3. A GUC max_joinable_partitions (open to suggestions for name) may
specify the maximum number of partitions that two relations may have
to be eligible for partition-wise join.

I guess, using these GUCs allows a user handle the trade-off between
getting the best plan and memory usage consciously. I think, users
would like to accept a suboptimal plans consciously than being thrown
a suboptimal plan without choice.


Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachment Content-Type Size
free_unused_paths.patch text/x-patch 15.5 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2016-10-28 07:29:58 Re: Proposal : For Auto-Prewarm.
Previous Message Andres Freund 2016-10-28 06:46:31 Re: Proposal: scan key push down to heap [WIP]