Re: Table Partitioning, Part 1

From: Hannu Krosing <hannu(at)skype(dot)net>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, bizgres-general <bizgres-general(at)pgfoundry(dot)org>
Subject: Re: Table Partitioning, Part 1
Date: 2005-05-10 13:31:55
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On E, 2005-05-09 at 23:30 +0100, Simon Riggs wrote:
> A more in-depth consideration of the major design options and trade-offs
> available to us... this is an internals related discussion.
> Comments?
> 1. Embellish inheritance or separate infrastructure?
> Inheritance is a somewhat strange PostgreSQL feature, with a few
> gotchas. The "big one" is the lack of a multi-relation index that can be
> used against all of the inherited relations also.

I think this can be solevd in two ways -

1. create a "merged index" which is similar to a partitioned table,
where inserts go to the index corresponding to underlying table but
selects (including uniqueness checks) are done by merging data from
several indexes. This also solves the problem of fast adding and
deleting the partitions to/from the global index.

2. second is variation of my old suggestion of using the fact that we
have 1G storage tables for implementing "partitioned tables" - just make
a mapping from each 1G (128kpage) range to subtable/offset and then
build the index as usual. This has problems with max partitioned table
size (same as our current max for ordinary tables) and also slow adding
of new partitions - deleting can be done fast by mapping the allocated
range to 'removed'.

If 1. can be implemented reasonably efficiently for large number of
merged indexes then there is no point in investigating 2. .

> The full complexity of multiple inheritance, possible multi-level
> inheritance doesn't seem to solve any problems for us. The main argument
> in favour of improving inheritance as a route to partitioning is that
> many people have already developed applications using it to provide some
> of the desirable aspects of partitioning. The amount of code used to
> provide some of these features is actually quite low, and the code
> itself is already specialised.
> It seems prudent to avoid building on that foundation, even though we
> may decide to use some similar approaches.

I would not mind if we give up multiple inheritance.

OTOH multi-level inheritance (and thus multi-level partitioning) can be
quite useful sometimes. For example I could have old data partitioned by
YEAR, but newer also sub-partitioned by month or even day.

Then I could manually PE all previous year by simply doing my query
against sales_2005 and let the automatic PE worry for doing the right
thing within this year.

Also, the multi-level nature of inheritance should be hidden from
planner/optimiser by moving the sub-sub-sub tables up the hierarchy into
a single flat append-like plan.

At least until/if we start applying check constraints to table and its
all child tables, in which case some of the PE could be done on
intermediate nodes before expansion.

> 2. Individual Relations explicitly in the plan or MultiRelation plan
> nodes? (i.e. is a SeqScan of a Partitioned Table one Node or many
> nodes?)
> Currently, Inheritance produces a plan for a SeqScan that looks like
> this...
> Append
> SeqScan
> SeqScan
> SeqScan

Seems reasonable for a seqscan :)

I have an inheritance-paritioned table of about 50 tables and when I
restrict it on some column that has an index then the plan does index-
scan on bigger tables and seqscan with some old tables which have only a
few thousands of rows,

> ISTM fairly straightforward to produce a similar "static" plan along the
> same lines, using Result nodes to implement Partition Elimination.
> Append
> Result
> SeqScan
> Result
> SeqScan
> Result
> SeqScan

So you mean another node that looks inside seqscans restrictions and
then determines if there is any chance that seqscan will return any
rows ?

Why can't this be done at planning time ? And if it can't be done at
planning time, then how do you determine which plan is cheapest ?

> Some may think that looks a little clunky, especially since the plan
> length would increase as the number of Partitions increased. An
> alternative, would be to have a single Scan node that performs a Multi-
> Partition Scan, which looks like this...
> PartitionSeqScan
> There'd always one node, no matter how many partitions. Again, easy to
> implement, since we just add a new type of Scan access method, or alter
> the existing ones iff we access partitioned tables. We'd be able to use
> physical tlists, and all the complexity is hidden.

Pehaps we could have just one scan type which looks like this


And hide all the complexity within it ? ;p

Greater visibility of internals is good, both for optimiser and for user
trying to understand what is going on (explain analyse).

I don't think many nodes are a problem per se, only when they cause some
non-linear growth in some operations.

> My current thinking is to implement explicitly partition plans.


<OT: philosopical musing>

As a general guideline I think it is (almost) never a good idea to trade
CPU cycles for disk accesses.

The speed difference between CPU and disk seems destined to grow in the
forseeable future and unless some cheap RAM-like storage suddenly
emerges, this does not change.

</OT: philosopical musing>

> But what about join plans? There are Use Cases that depend almost
> entirely upon joins for Partition Elimination rather than direct
> restriction of the PPK. (PPUC2)
> How could a join plan be performed with the explicit plan shown above?
> The only option would seem to be a Sort/Merge, which could be a disaster
> if we only wanted a few rows returned from a very large table.

I think the new bitmap scan code in 8.1 solves a similar problem, only
at a smaller scale (eliminating pages/tuples instead of partitions).

Perhaps something similar can be used here for PE.

> If all partitions in the query had identical indexes on them, then we
> have another option. In that case, each index could be thought to form
> part of a larger index ordered initially on the Partitioning Key (PPK).
> If the first column was actually the PPK, then the set of indexes would
> be exactly equivalent to a Global Index. We can call this a Pseudo
> Global Index.
> The Pseudo Global Index could also be used to enforce uniqueness. If all
> of the composite indexes were defined unique and the index contained the
> PPK as one of its columns, this would work.
> The index enforces
> uniqueness within each partition and the PPK enforces uniqueness across
> partitions because the same PPK value cannot be in two partitions.

But only uniqueness of PPK, not any other columns.

> We can also use the Pseudo Global Index as the object of a multi-index
> scan node, since the PGI would be ordered on the PPK. The plan node
> would use the partition boundary definitions to locate the starting
> partition, then seek within the partition to the correct part of the
> index then scan sideways, skipping to the next partition as required.
> This sounds a great idea, but it is exactly the technique I argued
> against only 4 paragraphs ago for use with SeqScans.

Also this works only for PPK and not other columns.

> My current thinking is that we need a mix of the two approaches, though
> I raise all of this here to gain everybody's input...

I'm afraid that hiding the decisions inside the access methods will make
things harder for the optimiser .

Still there may be cases where smarter access methods make sense as an
additional feture, though I cant come up with an example right now.

> 3. Partitions are relations? Or sub-relations?
> If we ever want to implement explicit plans, as discussed above, ISTM
> that we should consider partitions to be full relations.
> It seems much simpler to have partitions as relations...


> 4. Should all partitions have the same indexes?
> It also seems that we have somewhat opposing Use Cases: PPUC2 requires
> joins to achieve Partition Elimination, PPUC 5.2 requires us to have no
> indexes on older partitions (since they reside on tape).
> Some people will say that having all indexes defined the same eases
> maintenance. Others will say it gives us flexibility.
> I would argue in favour of flexibility now and automation later.


> 5. Constraints or specific Partitioning syntax?
> Partition Elimination relies upon being able to prove at execution time
> that two clauses in a query restriction are always false when taken
> together.

As much as possible of PE should be done at planning time, as only then
can our cost-based optimiser make use of it.

Execution-time PE (smart access methods) is mostly a black box for
optimiser, and thus may often produce pessimal plans.

> Taken generically, extending the concept of Partitioning to the max, we
> might imagine that we could take any number of constraints, then AND
> them together with an arbitrarily complex restriction clause, solve it
> and then eliminate partitions. That seems too much work to solve that
> generically and provably. I aim for an easier subset, with a more
> bounded solution algorithm.

This could be done by converting the "arbitrarily complex restriction
clause" and convert it into a list of "ANDed simpler clauses" (I guess
we already have most code needed for that) and then run over "simple
enough" clauses. If we can prove that any 2 of them contradict then the
partition can be liminated.

There are 2 possibly expensive steps:

1. the conversion to "AND'ed list of simple clauses" (unknown

2. matching each of "simple" clauses in the and list with all others
(should be N+(N-1)+(N-2)+..+(1) ~= 2N) complexity)

fortunately 1. needs to be only once, as this part stays constant for
the query and all constraints bfrom the subtable/partition (CHECK's and
other) are appended to AND list. The appended list does also not need to
be checked among themselves as they are by definition non-contradictory
(unless the partition is empty ;) further decreasing the complexity.

This approach can be made equal to your proposal if the check for
simplicity of clauses returns true only for range and equality
comparisons on PPK.

Then later when the system is made to deal with more complicated pairs
of clauses , the "is simple clause" check will be expanded

Of course the checks themselves can bail out with "don't know" response
any time, so the "is simple clause" is just an optimisation.

For a query with a where clause expanding to 4 part ANDed list and a
subtable with 3 check's we need to do only 4 x 3 = 12 PE comparisons per
partition if we compare each restrictio from where clause with each from

If we only have 1 "simple enough" clause in WHERE and 1 in CHECK then we
need to do 4 "simple enough" check per plan + 3 per partiotion + one
actual PE check per partition.

> If we know which constraints are to be used for Partition Elimination,
> and we make some restrictions on the types of restricion clauses that
> can take part in Partition Elimination, then the task gets easier.
> So, I think this requires explicit syntax, even though I am very
> attracted to minimal syntax implementations.

I don't like explicit syntax. We have objected adding optimiser hints so
far, why should we start now ?

Maybe we could just consider all range and equality checks (IN expands
to a list of OR's and does not qualify here, NOT IN does, as this is a
list of ANDs) for the start and move on later if needed.

> Anyway, that was a bit of a ramble, so thanks for reading this far.

So were my comments :) Thanks for everybody who has read this far !

Hannu Krosing <hannu(at)skype(dot)net>

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-05-10 13:43:15 Re: Oracle Style packages on postgres
Previous Message Bruce Momjian 2005-05-10 12:54:07 Re: request for sql3 compliance for the update command