Re: Partition-wise join for join between (declaratively) partitioned tables

From: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-11-04 10:52:06
Message-ID: CAFjFpRfQRvv8+bHyQbEDXN=CXJDbasHfRmVS5mZZLC4wCaz7bQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 31, 2016 at 6:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat
> <ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
>> I think there are going to be two kinds of partitioning use-cases.
>> First, carefully hand-crafted by DBAs so that every partition is
>> different from other and so is every join between two partitions.
>> There will be lesser number of partitions, but creating paths for each
>> join between partitions will be crucial from performance point of
>> view. Consider, for example, systems which use partitions to
>> consolidate results from different sources for analytical purposes or
>> sharding. If we consider various points you have listed in [1] as to
>> why a partition is equivalent to a table, each join between partitions
>> is going to have very different characteristics and thus deserves a
>> set of paths for its own. Add to that possibility of partition pruning
>> or certain conditions affecting particular partitions, the need for
>> detailed planning evident.
>>
>> The other usage of partitioning is to distribute the data and/or
>> quickly eliminate the data by partition pruning. In such case, all
>> partitions of a given table will have very similar properties. There
>> is a large chance that we will end up having same plans for every
>> partition and for joins between partitions. In such cases, I think it
>> suffices to create paths for just one or may be a handful partitions
>> of join and repeat that plan for other partitions of join. But in such
>> cases it also makes sense to have a light-weight representation for
>> partitions as compared to partitions being a full-fledged tables. If
>> we have such a light-weight representation, we may not even create
>> RelOptInfos representing joins between partitions, and different paths
>> for each join between partitions.
>
> I'm not sure I see a real distinction between these two use cases. I
> think that the problem of differing data distribution between
> partitions is almost always going to be an issue. Take the simple
> case of an "orders" table which is partitioned by month. First, the
> month that's currently in progress may be much smaller than a typical
> completed month. Second, many businesses are seasonal and may have
> many more orders at certain times of year. For example, in American
> retail, many businesses have large spikes in December. I think some
> businesses may do four times as much business in December as any other
> month, for example. So you will have that sort of variation, at
> least.
>
>> A typical join tree will be composite: some portion partitioned and
>> some portion unpartitioned or different portions partitioned by
>> different partition schemes. In such case, inaccurate costs for
>> PartitionJoinPath, can affect the plan heavily, causing a suboptimal
>> path to be picked. Assuming that partitioning will be useful for large
>> sets of data, choosing a suboptimal plan can be more dangerous than
>> consuming memory for creating paths.
>
> Well, sure. But, I mean, every simplifying assumption which the
> planner makes to limit resource consumption could have that effect.
> join_collapse_limit, for example, can cause horrible plans. However,
> we have it anyway, because the alternative of having planning take far
> too long is unpalatable. Planning is always, at some level,
> guesswork.

My point is, this behaviour is configurable. Users who are ready to
spend time and resources to get the best plan are still able to do so,
by choosing a higher limit on join_collapse_limit. Those who can not
afford to do so, are ready to use inferior plans willingly by setting
join_collapse_limit to a lower number.

>
>> A possible solution would be to keep the track of used paths using a
>> reference count. Once the paths for given join tree are created, free
>> up the unused paths by traversing pathlist in each of the RelOptInfos.
>> Attached patch has a prototype implementation for the same. There are
>> some paths which are not linked to RelOptInfos, which need a bit
>> different treatment, but they can be handled too.
>
> So, if you apply this with your previous patch, how much does it cut
> down memory consumption?

Answered this below:

>
>>> In that way, peak memory usage only grows by
>>> about a factor of 2 rather than a factor equal to the partition count,
>>> because we don't need to keep every possibly-useful path for every
>>> partition all at the same time, but rather every possibly-useful path
>>> for a single partition.
>>>
>>> Maybe there are other ideas but I have a feeling any way you slice it
>>> this is going to be a lot of work.
>>
>> For the case of carefully hand-crafted partitions, I think, users
>> would expect the planner to use really the best plan and thus may be
>> willing to accommodate for increased memory usage. Going by any
>> approach that does not create the paths for joins between partitions
>> is not guaranteed to give the best plan. Users willing to provide
>> increased memory will be unhappy if we do not give them the best path.
>>
>> The user who creates hundreds of partitions, will ideally be using
>> pretty powerful servers with a lot of memory. On such servers, the
>> linear increase in memory for paths may not be as bad as you are
>> portraying above, as long as its producing the best plan.
>
> No, I don't agree. We should be trying to build something that scales
> well. I've heard reports of customers with hundreds or even thousands
> of partitions; I think it is quite reasonable to think that we need to
> scale to 1000 partitions. If we use 3MB of memory to plan a query
> involving unpartitioned, using 3GB to plan a query where the main
> tables have been partitioned 1000 ways does not seem reasonable to me.

Here are memory consumption numbers.

For a simple query "select * from v5_prt100", where v5_prt100 is a
view on a 5 way self join of table prt100, which is a plain table with
100 partitions without any indexes.
postgres=# \d+ v5_prt100
View "part_mem_usage.v5_prt100"
Column | Type | Modifiers | Storage | Description
--------+--------+-----------+----------+-------------
t1 | prt100 | | extended |
t2 | prt100 | | extended |
t3 | prt100 | | extended |
t4 | prt100 | | extended |
t5 | prt100 | | extended |
View definition:
SELECT t1.*::prt100 AS t1,
t2.*::prt100 AS t2,
t3.*::prt100 AS t3,
t4.*::prt100 AS t4,
t5.*::prt100 AS t5
FROM prt100 t1,
prt100 t2,
prt100 t3,
prt100 t4,
prt100 t5
WHERE t1.a = t2.a AND t2.a = t3.a AND t3.a = t4.a AND t4.a = t5.a;

postgres=# \d prt100
Table "part_mem_usage.prt100"
Column | Type | Modifiers
--------+-------------------+-----------
a | integer |
b | integer |
c | character varying |
Partition Key: RANGE (a)
Number of partitions: 100 (Use \d+ to list them.)

Without partition-wise join the standard_planner() consumes 4311 kB
memory of which 150 kB is consumed in add_paths_to_joinrel().

With partition-wise join standard_planner() consumes 65MB memory,
which is 16 times more (not 100 times more as you suspected above). Of
this bloat 16MB is consumed for creating child join paths whereas
651kB is consumed in creating append paths. That's 100 times bloat for
path creation. Rest of the memory bloat is broken down as 9MB to
create child join RelOptInfos, 29MB to translate restrict clauses, 8MB
to translate target lists. 2MB for creating special join info for
children, 2MB goes into creating plans.

If we apply logic to free unused paths, the memory consumption reduces
as follows

Without partition-wise join standard_planner() consumes 4268 kB
(against 4311kB earlier) of which 123kB (against 150kB earlier) is
consumed in add_paths_to_joinrel().

With partition-wise join, standard_planner() consumes 63MB (against
65MB earlier). Child join paths still consume 13 MB (against 16MB
earlier), which is still 100 times that without using partition-wise
join. We may shave off some memory consumption by using better methods
than translating expressions, but we will continue to have bloats
introduced by paths, RelOptInfos for child joins etc.

So, I am thinking about your approach of creating PartitionJoinPaths
without actually creating child paths and then at a later stage
actually plan the child joins. Here's rough sketch of how that may be
done.

At the time of creating regular paths, we identify the join orders
which can use partition-wise join and save those in the RelOptInfo of
the parent table. If no such join order exists, we do not create
PartitionJoinPaths for that relation. Otherwise, once we have
considered all the join orders i.e. in
generate_partition_wise_join_paths(), we create one PartitionJoinPath
for every path that has survived in the parent or at least for every
path that has distinct properties like pathkeys or parameterisation,
with those properties.

At the time of creating plans, if PartitionJoinPath is chosen, we
actually create paths for every partition of that relation
recursively. The path creation logic is carried out in a different
memory context. Amongst the paths that survive, we choose the best
path that has the same properties as PartitionJoinPath. We would
expect all parameterized paths to be retained and any unparameterized
path can be sorted to match the pathkeys of reference
PartitionJoinPath. We then create the plan out of this path and copy
it into the outer memory context and release the memory context used
for path creation. This is similar to how prepared statements save
their plans. Once we have the plan, the memory consumed by paths won't
be referenced, and hence can not create problems. At the end we create
an Append/MergeAppend plan with all the child plans and return it.

Costing PartitionJoinPath needs more thought so that we don't end up
with bad overall plans. Here's an idea. Partition-wise joins are
better compared to the unpartitioned ones, because of the smaller
sizes of partitions. If we think of join as O(MN) operation where M
and N are sizes of unpartitioned tables being joined, partition-wise
join computes P joins each with average O(M/P * N/P) order where P is
the number of partitions, which is still O(MN) with constant factor
reduced by P times. I think, we need to apply similar logic to
costing. Let's say cost of a join is J(M, N) = S (M, N) + R (M, N)
where S and R are setup cost and joining cost (for M, N rows) resp.
Cost of partition-wise join would be P * J(M/P, N/P) = P * S(M/P, N/P)
+ P * R(M/P, N/P). Each of the join methods will have different S and
R functions and may not be linear on the number of rows. So,
PartitionJoinPath costs are obtained from corresponding regular path
costs subjected to above transformation. This way, we will be
protected from choosing a PartitionJoinPath when it's not optimal.
Take example of a join where the joining relations are very small in
size, thus hash join on full relation is optimal compared to hash join
of each partition because of setup cost. In such a case, the function
which calculates the cost of hash table setup, would result in almost
same cost for full table as well as each of the partitions, thus
increasing P * S(M/P, N/P) as compared to S(M, N).

Let me know your comments.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2016-11-04 10:55:18 Re: Push down more full joins in postgres_fdw
Previous Message Simon Riggs 2016-11-04 10:35:05 Re: Proposal for changes to recovery.conf API