Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-10-18 15:39:17
Message-ID: CA+TgmobkjJKixeJZT3dS5WkuXL72gkgTwQToxykp6YF0HqYiQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 14, 2016 at 12:37 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> Have you tested the effect of this patch on planner memory consumption
>> with multi-way joins between tables with many partitions? If you
>> haven't, you probably should. (Testing runtime would be good, too.)
>> Does it grow linearly? Quadratically? Exponentially? Minor leaks
>> don't matter, but if we're generating too much garbage we'll have to
>> make sure it gets cleaned up soon enough to prevent runaway memory
>> usage.
>
> I tried to check memory usage with various combinations of number of
> partitions and number of relations being joined. For higher number of
> relations being joined like 10 with 100 partitions, OOM killer kicked
> in during the planning phase. I am suspecting
> adjust_partitionrel_attrs() (changed that name to
> adjust_join_appendrel_attrs() to be in sync with
> adjust_appendrel_attrs()) to be the culprit. It copies expression
> trees every time for joining two children. That's an exponentially
> increasing number as the number of legal joins increases
> exponentially. I am still investigating this.

I think the root of this problem is that the existing paths shares a
lot more substructure than the ones created by the new code. Without
a partition-wise join, the incremental memory usage for a joinrel
isn't any different whether the underlying rel is partitioned or not.
If it's partitioned, we'll be pointing to an AppendPath; if not, we'll
be pointing to some kind of Scan. But the join itself creates exactly
the same amount of new stuff regardless of what's underneath it. With
partitionwise join, that ceases to be true. Every joinrel - and the
number of those grows exponentially in the number of baserels, IICU -
needs its own list of paths for every member rel. So if a
non-partition-wise join created X paths, and there are K partitions, a
partition-wise join creates X * K paths. That's a lot.

Although we might be able to save some memory by tightening things up
here and there - for example, right now the planner isn't real smart
about recycling paths that are evicted by add_path(), and there's
probably other wastage as well - I suspect that what this shows is
that the basic design of this patch is not going to be viable.
Intuitively, it's often going to be the case that we want the "same
plan" for every partition-set. That is, if we have A JOIN B ON A.x =
B.x JOIN C ON A.y = B.y, and if A, B, and C are all compatibility
partitioned, then the result should be an Append plan with 100 join
plans under it, and all 100 of those plans should be basically mirror
images of each other. Of course, that's not really right in general:
for example, it could be that A1 is big and A2 is small while B1 is
small and B2 is big, so that the right plan for (A1 JOIN B1) and for
(A2 JOIN B2) are totally different from each other. But in many
practical cases we'll want to end up with a plan of precisely the same
shape for all children, and the current design ignores this, expending
both memory and CPU time to compute essentially-equivalent paths
across all children.

One way of attacking this problem is to gang together partitions which
are equivalent for planning purposes, as discussed in the paper "Join
Optimization Techniques for Partitioned Tables" by Herodotou, Borisov,
and Babu. However, it's not exactly clear how to do this: we could
gang together partitions that have the same index definitions, but the
sizes of the heaps, the sizes of their indexes, and the row counts
will vary from one partition to the next, and any of those things
could cause the plan choice to be different for one partition vs. the
next. We could try to come up with heuristics for when those things
are likely to be true. For example, suppose we compute the set of
partitions such that all joined relations have matching index
definitions on all tables; then, we take the biggest table in the set
and consider all tables more than half that size as part of one gang.
The biggest table becomes the leader and we compute partition-wise
paths for just that partition; the other members of the gang will
eventually get a plan that is of the same shape, but we don't actually
create it that plan until after scan/join planning is concluded.

Another idea is to try to reduce peak memory usage by performing
planning separately for each partition-set. For example, suppose we
decide to do a partition-wise join of A, B, and C. Initially, this
gets represented as a PartitionJoinPath tree, like this:

PartitionJoinPath
-> AppendPath for A
-> PartitionJoinPath
-> AppendPath for B
-> AppendPath for C

Because we haven't created individual join paths for the members, this
doesn't use much memory. Somehow, we come up with a cost for the
PartitionJoinPath; it probably won't be entirely accurate. Once
scan/join planning is concluded, if our final path contains a
PartitionJoinPath, we go back and loop over the partitions. For each
partition, we switch to a new memory context, perform planning, copy
the best path and its substructure back to the parent context, and
then reset the context. In that way, peak memory usage only grows by
about a factor of 2 rather than a factor equal to the partition count,
because we don't need to keep every possibly-useful path for every
partition all at the same time, but rather every possibly-useful path
for a single partition.

Maybe there are other ideas but I have a feeling any way you slice it
this is going to be a lot of work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-10-18 15:42:52 Re: Add PGDLLEXPORT to PG_FUNCTION_INFO_V1
Previous Message Tom Lane 2016-10-18 14:58:04 Re: Add PGDLLEXPORT to PG_FUNCTION_INFO_V1