On partitioning

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: On partitioning
Date: 2014-08-29 15:56:07
Message-ID: 20140829155607.GF7705@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja
reference Tom's post
which mentions the possibility of a different partitioning
implementation than what we have so far. As it turns out, I've been
thinking about partitioning recently, so I thought I would share what
I'm thinking so that others can poke holes. My intention is to try to
implement this as soon as possible.

Declarative partitioning

In this design, partitions are first-class objects, not normal tables in
inheritance hierarchies. There are no pg_inherits entries involved at all.

Partitions are a physical implementation detail. Therefore we do not allow
the owner to be changed, or permissions to be granted directly to partitions;
all these operations happen to the parent relation instead.

System Catalogs

In pg_class we have two additional relkind values:

* relkind RELKIND_PARTITIONED_REL 'P' indicates a partitioned relation.
It is used to indicate a parent table, i.e. one the user can directly
address in DML queries. Such relations DO NOT have their own storage.
These use the same rules as regular tables for access privileges,
ownership and so on.

* relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned
relation (its parent). These cannot be addressed directly in DML
queries and only limited DDL support is provided. They don't have
their own pg_attribute entries either and therefore they are always
identical in column definitions to the parent relation. Since they
are not accessible directly, there is no need for ACL considerations;
the parent relation's owner is the owner, and grants are applied to
the parent relation only.
XXX --- is there a need for a partition having different column
default values than its parent relation?

Partitions are numbered sequentially, normally from 1 onwards; but it is
valid to have negative partition numbers and 0. Partitions don't have
names (except automatically generated ones for pg_class.relname, but
they are unusable in DDL).

Each partition is assigned an Expression that receives a tuple and
returns boolean. This expression returns true if a given tuple belongs
into it, false otherwise. If a tuple for a partitioned relation is run
through expressions of all partitions, exactly one should return true.
If none returns true, it might be because the partition has not been
created yet. A user-facing error is raised in this case (Rationale: if
user creates a partitioned rel and there is no partition that accepts
some given tuple, it's the user's fault.)

Additionally, each partitioned relation may have a master expression.
This receives a tuple and returns an integer, which corresponds to the
number of the partition it belongs into.

There are two new system catalogs:

pg_partitioned_rel --> (prelrelid, prelexpr) pg_partition -->
(partrelid, partseq, partexpr, partoverflow)

For partitioned rels that have prelexpr, we run that expression and
obtain the partition number; as a crosscheck we run partexpr and ensure
it returns true. For partitioned rels that don't have prelexpr, we run
partexpr for each partition in turn until one returns true. This means
that for a properly set up partitioned table, we need to run a single
expression on a tuple to find out what partition the tuple belongs into.

Per-partition expressions are formed as each partition is created, and
are based on the user-supplied partitioning criterion. Master
expressions are formed at relation creation time. (XXX Can we change
the master expression later, as a result of some ALTER command?
Presumably this would mean that all partitions might need to be

Triggers --------

(These are user-defined triggers, not partitioning triggers. In fact
there are no partitioning triggers at all.)

Triggers are attached to the parent relation, not to the specific
partition. When a trigger function runs on a tuple inserted, updated or
modified on a partition, the data received by the trigger function makes
it appear that the tuple belongs to the parent relation. There is no
need to let the trigger know which partition the tuple went in or came
from. XXX is there a need to give it the partition number that the
tuple went it?

Syntax ------

CREATE TABLE xyz ( ... ) PARTITION BY RANGE ( a_expr ) This creates the
main table only: no partitions are created automatically.

We do not support other types of partitioning at this stage. We will
implement these later.

We do not currently support ALTER TABLE/PARTITION BY (i.e. partition a
table after the fact). We leave this as a future improvement.

PARTITION <n> This creates a new partition * ALTER TABLE <xyz> CREATE
PARTITION FOR <value> Same as above; the partition number is determined

Allowed actions on a RELKIND_PARTITION:

<xyz> * VACUUM parent PARTITION <n>

As a future extension we will allow partitions to become detached from
the parent relation, thus becoming an independent table. This might be
a relatively expensive operation: pg_attribute entries need to be
created, for example.

Overflow Partitions -------------------

There is no explicit concept of overflow partitions.

Vacuum, aging -------------

PARTITIONED_RELs, not containing tuples directly, do not have
relfrozenxid or relminmxid. Each partition has individual values for
these variables.

Autovacuum knows to ignore PARTITIONED_RELs, and considers each

Each partition is vacuumed as a normal relation.

Planner -------

A partitioned relation behaves just like a regular relation for purposes
of planner. XXX do we need special considerations regarding relation
size estimation?

For scan plans, we need to prepare Append lists which are used to scan
for tuples in a partitioned relation. We can setup fake constraint
expressions based on the partitioning expressions, which let the planner
discard unnecessary partitions by way of constraint exclusion.

(In the future we might be interested in creating specialized plan and
execution nodes that know more about partitioned relations, to avoid
creating useless Append trees only to prune them later.)

Executor --------

When doing an INSERT or UPDATE ResultRelInfo needs to be expanded for
partitioned relations: the target relation of an insertion is the parent
relation, but the actual partition needs to be resolved at ModifyTable
execution time. This means RelOptInfo needs to know about partitions;
either we deal with them as "other rels" terms, or we create a new
RelOptKind. At any rate, running the partitioning expression on the new
tuple would give an partition index. This needs to be done once for
each new tuple.

I think during ExecInsert, after running triggers and before executing
constraints, we need to switch resultRelationDesc from the parent
relation into the partition-specific relation.

ExecInsertIndexTuples only knows about partitions. It's an error to
call it using a partitioned rel.

Heap Access Method ------------------ For the purposes of low-level
routines in heapam.c, only partitions exist; trying to insert or modify
tuples in a RELKIND_PARTITIONED_REL is an error. heap_insert and
heap_multi_insert only accept inserting tuples into an individual
partition. These routines do not check that the tuples belong into the
specific partition; that's responsibility of higher-level code. Because
of this, code like COPY will need to make its own checks. Maybe we
should offer another API (in between high-level things such as
ModifyTable/COPY and heapam.c) that receives tuples into a
PARTITIONED_REL and routes them into specific partitions. Note: need to
ensure we do not slow down COPY for the regular case of

Taking backups --------------

pg_dump is able to dump a partitioned relation as a CREATE
commands. The data of all partitions is considered a single COPY

XXX this limits the ability to restore in parallel. To fix we might consider
using one COPY for each partition. It's not clear what relation should be
mentioned in such a COPY command, though -- my instinct is that it
should reference the parent table only, not the individual partition.

Previous Discussion

Auto Partitioning Patch - WIP version 1
(Nikhil Sontakke, March 2007)

Declarative partitioning grammar
(Gavin Sherry, January 2008)

Patch for automating partitions in PostgreSQL 8.4 Beta 2
(Kedar Potdar, Jun 2009)

Syntax for partitioning
(Itagaki Takahiro, Oct 2009)

Partitioning syntax
(Itagaki Takahiro, Jan 2010)

Not really related:
Dynamic Partitioning using Segment Visibility Maps
(Simon Riggs, January 2008)

Still To Be Designed
* Dependency issues
* Are indexes/constraints inherited from the parent rel?
* Multiple keys? Subpartitioning? Hash partitioning?

Open Questions

* What's the syntax to refer to specific partitions within a partitioned
We could do "TABLE <xyz> PARTITION <n>", but for example if in
the future we add hash partitioning, we might need some non-integer
addressing (OTOH assigning sequential numbers to hash partitions doesn't
seem so bad). Discussing with users of other DBMSs partitioning feature,
one useful phrase is "TABLE <xyz> PARTITION FOR <value>".

* Do we want to provide partitioned materialized views?

Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-08-29 16:35:50 Re: On partitioning
Previous Message Greg Stark 2014-08-29 15:27:07 Re: Why data of timestamptz does not store value of timezone passed to it?