Re: Partition-wise join for join between (declaratively) partitioned tables

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>
Cc: Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partition-wise join for join between (declaratively) partitioned tables
Date: 2016-10-31 13:07:54
Message-ID: CA+TgmoZu6NeKeCk_iq5cQA3ebwVcEz23OtJVREBXk9W55irgeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 28, 2016 at 3:09 AM, Ashutosh Bapat
<ashutosh(dot)bapat(at)enterprisedb(dot)com> wrote:
> I think there are going to be two kinds of partitioning use-cases.
> First, carefully hand-crafted by DBAs so that every partition is
> different from other and so is every join between two partitions.
> There will be lesser number of partitions, but creating paths for each
> join between partitions will be crucial from performance point of
> view. Consider, for example, systems which use partitions to
> consolidate results from different sources for analytical purposes or
> sharding. If we consider various points you have listed in [1] as to
> why a partition is equivalent to a table, each join between partitions
> is going to have very different characteristics and thus deserves a
> set of paths for its own. Add to that possibility of partition pruning
> or certain conditions affecting particular partitions, the need for
> detailed planning evident.
>
> The other usage of partitioning is to distribute the data and/or
> quickly eliminate the data by partition pruning. In such case, all
> partitions of a given table will have very similar properties. There
> is a large chance that we will end up having same plans for every
> partition and for joins between partitions. In such cases, I think it
> suffices to create paths for just one or may be a handful partitions
> of join and repeat that plan for other partitions of join. But in such
> cases it also makes sense to have a light-weight representation for
> partitions as compared to partitions being a full-fledged tables. If
> we have such a light-weight representation, we may not even create
> RelOptInfos representing joins between partitions, and different paths
> for each join between partitions.

I'm not sure I see a real distinction between these two use cases. I
think that the problem of differing data distribution between
partitions is almost always going to be an issue. Take the simple
case of an "orders" table which is partitioned by month. First, the
month that's currently in progress may be much smaller than a typical
completed month. Second, many businesses are seasonal and may have
many more orders at certain times of year. For example, in American
retail, many businesses have large spikes in December. I think some
businesses may do four times as much business in December as any other
month, for example. So you will have that sort of variation, at
least.

> A typical join tree will be composite: some portion partitioned and
> some portion unpartitioned or different portions partitioned by
> different partition schemes. In such case, inaccurate costs for
> PartitionJoinPath, can affect the plan heavily, causing a suboptimal
> path to be picked. Assuming that partitioning will be useful for large
> sets of data, choosing a suboptimal plan can be more dangerous than
> consuming memory for creating paths.

Well, sure. But, I mean, every simplifying assumption which the
planner makes to limit resource consumption could have that effect.
join_collapse_limit, for example, can cause horrible plans. However,
we have it anyway, because the alternative of having planning take far
too long is unpalatable. Planning is always, at some level,
guesswork.

>> For each
>> partition, we switch to a new memory context, perform planning, copy
>> the best path and its substructure back to the parent context, and
>> then reset the context.
>
> This could be rather tricky. It assumes that all the code that creates
> paths for joins, should not allocate any memory which is linked to
> some object in a context that lives longer than the path creation
> context. There is some code like create_join_clause() or
> make_canonical_pathkey(), which carefully chooses which memory context
> to allocate memory in. But can we ensure it always? postgres_fdw for
> example allocates memory for PgFdwRelationInfo in current memory
> context and attaches it in RelOptInfo, which should be in the
> planner's original context. So, if we create a new memory context for
> each partition, fpinfos would be invalidated when those contexts are
> released. Not that, we can not enforce some restriction on the memory
> usage while planning, it's hard to enforce it and bugs arising from it
> may go unnoticed. GEQO planner might have its own problems with this
> approach. Third party FDWs will pose a problem.

Yep, there are problems. :-)

> A possible solution would be to keep the track of used paths using a
> reference count. Once the paths for given join tree are created, free
> up the unused paths by traversing pathlist in each of the RelOptInfos.
> Attached patch has a prototype implementation for the same. There are
> some paths which are not linked to RelOptInfos, which need a bit
> different treatment, but they can be handled too.

So, if you apply this with your previous patch, how much does it cut
down memory consumption?

>> In that way, peak memory usage only grows by
>> about a factor of 2 rather than a factor equal to the partition count,
>> because we don't need to keep every possibly-useful path for every
>> partition all at the same time, but rather every possibly-useful path
>> for a single partition.
>>
>> Maybe there are other ideas but I have a feeling any way you slice it
>> this is going to be a lot of work.
>
> For the case of carefully hand-crafted partitions, I think, users
> would expect the planner to use really the best plan and thus may be
> willing to accommodate for increased memory usage. Going by any
> approach that does not create the paths for joins between partitions
> is not guaranteed to give the best plan. Users willing to provide
> increased memory will be unhappy if we do not give them the best path.
>
> The user who creates hundreds of partitions, will ideally be using
> pretty powerful servers with a lot of memory. On such servers, the
> linear increase in memory for paths may not be as bad as you are
> portraying above, as long as its producing the best plan.

No, I don't agree. We should be trying to build something that scales
well. I've heard reports of customers with hundreds or even thousands
of partitions; I think it is quite reasonable to think that we need to
scale to 1000 partitions. If we use 3MB of memory to plan a query
involving unpartitioned, using 3GB to plan a query where the main
tables have been partitioned 1000 ways does not seem reasonable to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-10-31 13:09:43 Re: Proposal : For Auto-Prewarm.
Previous Message Peter Eisentraut 2016-10-31 12:53:52 Re: sequences and pg_upgrade