Quick Links

Declarative partitioning

From:	Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Declarative partitioning
Date:	2015-08-18 10:30:20
Message-ID:	55D3093C.5010800@lab.ntt.co.jp
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I would like propose $SUBJECT for this development cycle. Attached is a
WIP patch that implements most if not all of what's described below. Some
yet unaddressed parts are mentioned below, too. I'll add this to the CF-SEP.

Syntax
======

1. Creating a partitioned table

CREATE TABLE table_name
PARTITION BY {RANGE|LIST}
ON (column_list);

Where column_list consists of simple column names or expressions:

PARTITION BY LIST ON (name)
PARTITION BY RANGE ON (year, month)

PARTITION BY LIST ON ((lower(left(name, 2)))
PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))

Note: LIST partition key supports only one column.

For each column, you could write operator class name:

PARTITION BY LIST/RANGE ON (colname [USING] opclass_name),

If not specified, the default btree operator class based on type of each
key column is used. If none of the available btree operator classes are
compatible with the partitioning strategy (list/range), error is thrown.
Built-in btree operator classes cover a good number of types for list and
range partitioning in practical scenarios.

A table created using this form is of proposed new relkind
RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is
created to store partition key info.

Note: A table cannot be partitioned after-the-fact using ALTER TABLE.

Normal dependencies are created between the partitioned table and operator
classes, object in partition expressions like functions.

2. Creating a partition of a partitioned table

CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec;

Where values_spec is:

listvalues: [IN] (val1, ...)

rangevalues: START (col1min, ... ) END (col1max, ... )
| START (col1min, ... )
| END (col1max, ... )

A table created using this form has proposed pg_class.relispartition set
to true. An entry in pg_partition (see below) is created to store the
partition bound info.

The values_spec should match the partitioning strategy of the partitioned
table. In case of a range partition, the values in START and/or END should
match columns in the partition key.

Defining a list partition is fairly straightforward - just spell out the
list of comma-separated values. Error is thrown if the list of values
overlaps with one of the existing partitions' list.

CREATE TABLE persons_by_state (name text, state text)
PARTITION BY LIST ON (state);

CREATE TABLE persons_IL
PARTITION OF persons_by_state
FOR VALUES IN ('IL');

CREATE TABLE persons_fail
PARTITION OF persons_by_state
FOR VALUES IN ('IL');
ERROR: cannot create partition that overlaps with an existing one

For a range partition, there are more than one way:

Specify both START and END bounds: resulting range should not overlap with
the range(s) covered by existing partitions. Error is thrown otherwise.
Although rare in practice, gaps between ranges are OK.

CREATE TABLE measurement(logdate date NOT NULL)
PARTITION BY RANGE ON (logdate);

CREATE TABLE measurement_y2006m02
PARTITION OF measurement
FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-15') END ('2006-03-01');
ERROR: cannot create partition that overlaps with an existing one

Specify only the START bound: add the partition on the left of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition will cover the range [START, +INFINITY) and become the rightmost
partition. Error is thrown if the specified START causes overlap.

CREATE TABLE measurement_y2006m01
PARTITION OF measurement
FOR VALUES START ('2006-01-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one

Specify only the END bound: add the partition on the right of some range
covered by existing partitions provided no overlap occurs (also
considering gaps between ranges, if any). If no such range exists, the new
partition would cover the range (-INFINITY, END) and become the leftmost
partition. Error is thrown if the specified END causes overlap.

CREATE TABLE measurement_y2006m03
PARTITION OF measurement
FOR VALUES END ('2006-04-01'); --success

CREATE TABLE measurement_fail
PARTITION OF measurement
FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02
ERROR: cannot create partition that overlaps with an existing one

For each partition, START and END bound values are stored in the
catalog. Note that the lower bound is inclusive, whereas the upper bound
is exclusive.

Note: At most one range partition can have null min bound in which case it
covers the range (-INFINITY, END). Also, at most one range partition can
have null max bound in which case it covers the range [START, +INFINITY).

A normal dependency is created between the parent and the partition.

3. Multi-level partitioning

CREATE TABLE table_name
PARTITION OF partitioned_table_name
FOR VALUES values_spec
PARTITION BY {RANGE|LIST} ON (columns_list)

This variant implements a form of so called composite or sub-partitioning
with arbitrarily deep partitioning structure. A table created using this
form has both the relkind RELKIND_PARTITIONED_REL and
pg_class.relispartition set to true.

4. (yet unimplemented) Attach partition (from existing table)

ALTER TABLE partitioned_table
ATTACH PARTITION partition_name
FOR VALUES values_spec
USING [TABLE] table_name;

ALTER TABLE table_name
SET VALID PARTITION OF <parent>;

The first of the above pair of commands would attach table_name as a (yet)
'invalid' partition of partitioned_table (after confirming that it matches
the schema and does not overlap with other partitions per FOR VALUES
spec). It would also record the FOR VALUES part in the partition catalog
and set pg_class.relispartition to true for table_name.

After the first command is done, the second command would take exclusive
lock on table_name, scan the table to check if it contains any values
outside the boundaries defined by FOR VALUES clause defined previously,
throw error if so, mark as valid partition of parent if not.

Does that make sense?

5. Detach partition

ALTER TABLE partitioned_table
DETACH PARTITION partition_name [USING table_name]

This removes partition_name as partition of partitioned_table. The table
continues to exist with the same name or 'table_name', if specified.
pg_class.relispartition is set to false for the table, so it behaves like
a normal table.

System catalogs
===============

1. pg_partitioned_rel

CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS
{
Oid partrelid; /* partitioned table pg_class.oid */
char partstrategy; /* partitioning strategy 'l'/'r' */
int16 partnatts; /* number of partition columns */
int2vector partkey; /* column numbers of partition columns;
* 0 where specified column is an
* expresion */
oidvector partclass; /* operator class to compare keys */
pg_node_tree partexprs; /* expression trees for partition key
* members that are not simple column
* references; one for each zero entry
* in partkey[] */
};

2. pg_partition (omits partisvalid alluded to above)

CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS
{
Oid partitionid; /* partition oid */
Oid partparent; /* parent oid */
anyarray partlistvalues; /* list of allowed values of the only
* partition column */
anyarray partrangebounds; /* list of bounds of ranges of
* allowed values per partition key
* column */
};

Further notes
=============

There are a number of restrictions on performing after-the-fact changes
using ALTER TABLE to partitions (ie, relispartition=true):

* Cannot add/drop column
* Cannot set/drop OIDs
* Cannot set/drop NOT NULL
* Cannot set/drop default
* Cannot alter column type
* Cannot add/drop alter constraint (table level)
* Cannot change persistence
* Cannot change inheritance
* Cannot link to a composite type

Such changes should be made to the topmost parent in the partitioning
hierarchy (hereafter referred to as just parent). These are recursively
applied to all the tables in the hierarchy. Although the last two items
cannot be performed on parent either.

Dropping a partition using DROP TABLE is not allowed. It needs to detached
using ALTER TABLE on parent before it can be dropped as a normal table.

Triggers on partitions are not allowed. They should be defined on the
parent. That said, I could not figure out a way to implement row-level
AFTER triggers on partitioned tables (more about that in a moment); so
they are currently not allowed:

CREATE TRIGGER audit_trig
AFTER INSERT ON persons
FOR EACH ROW EXECUTE PROCEDURE audit_func();
ERROR: Row-level AFTER triggers are not supported on partitioned tables

Column/table constraints on partitions are not allowed. They should be
defined on the parent. Foreign key constraints are not allowed due to
above limitation (no row-level after triggers).

A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage.

Creating index on parent is not allowed. They should be defined on (leaf)
partitions. Because of this limitation, primary keys are not allowed on a
partitioned table. Perhaps, we should be able to just create a dummy
entry somewhere to represent an index on parent (which every partition
then copies.) Then by restricting primary key to contain all partition key
columns, we can implement unique constraint over the whole partitioned
table. That will in turn allow us to use partitioned tables as PK rels in
a foreign key constraint provided row-level AFTER trigger issue is resolved.

VACUUM/ANALYZE on individual partitions should work like normal tables.
I've not implemented something like inheritance tree sampling for
partitioning tree in this patch. Autovacuum has been taught to ignore
parent tables and vacuum/analyze partitions normally.

Dropping a partitioned table should (?) unconditionally drop all its
partitions probably but, currently the patch uses dependencies, so
requires to specify CASCADE to do the same.

What should TRUNCATE on partitioned table do?

Ownership, privileges/permissions, RLS should be managed through the
parent table although not comprehensively addressed in the patch.

There is no need to define tuple routing triggers. CopyFrom() and
ExecInsert() determine target partition just before performing
heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
constraints (on parent) are executed for tuple before being routed to a
partition. If no partition can be found, it's an error.

Because row-level AFTER triggers need to save ItemPointers in trigger
event data and defining triggers on partitions (which is where tuples
really go) is not allowed, I could not find a straightforward way to
implement them. So, perhaps we should allow (only) row-level AFTER
triggers on partitions or think of modifying trigger.c to know about this
twist explicitly.

Internal representations
========================

For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new
fields to store the partitioning metadata. That includes partition key
tuple (pg_partitioned_rel) including some derived info (opfamily,
opcintype, compare proc FmgrInfos, partition expression trees).

Additionally, it also includes a PartitionInfo object which includes
partition OIDs array, partition bound arrays (if range partitioned,
rangemax is sorted in ascending order and OIDs are likewise ordered). It
is built from information in pg_partition catalog.

While RelationBuildDesc() initializes the basic key info, fields like
expression trees, PartitionInfo are built on demand and cached. For
example, InitResultRelInfo() builds the latter to populate the newly added
ri_PartitionKeyInfo and ri_Partitions fields, respectively.

PartitionInfo object is rebuilt on every cache invalidation of the rel
which includes when adding/attaching/detaching a new partition.

Planner and executor considerations
=====================================

The patch does not yet implement any planner changes for partitioned
tables, although I'm working on the same and post updates as soon as
possible. That means, it is not possible to run SELECT/UPDATE/DELETE
queries on partitioned tables without getting:

postgres=# SELECT * FROM persons;
ERROR: could not open file "base/13244/106975": No such file or directory

Given that there would be more direct ways of performing partition pruning
decisions with the proposed, it would be nice to utilize them.
Specifically, I would like to avoid having to rely on constraint exclusion
for partition pruning whereby subquery_planner() builds append_rel_list
and the later steps exclude useless partitions.

By extending RelOptInfo to include partitioning info for partitioned rels,
it might be possible to perform partition pruning directly without
previously having to expand them. Although, as things stand now, it's not
clear how that might work - when would partition RTEs be added to the
rtable? The rtable is assumed not to change after
setup_simple_rel_arrays() has done its job which is much earlier than when
it would be desirable for the partitioned table expansion (along with
partition pruning) to happen. Moreover, if that means we might not be able
to build RelOptInfo's for partitions, how to choose best paths for them
(index paths or not, etc.)?

I'm also hoping we don't require something like inheritance_planner() for
when partitioned tables are target rels. I assume considerations for why
the special processing is necessary for inheritance trees in that scenario
don't apply to partitioning trees. So, if grouping_planner() returns a
Append plan (among other options) for the partitioning tree, tacking a
ModifyTable node on top should do the trick?

Suggestions greatly welcome in this area.

Other items
===========

Will include the following once we start reaching consensus on main parts
of the proposed design/implementation:

* New regression tests
* Documentation updates
* pg_dump, psql, etc.

For reference, some immediately previous discussions:

* On partitioning *
http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org

* Partitioning WIP patch *
http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp

Comments welcome!

Thanks,
Amit

Attachment	Content-Type	Size
declarative-partitioning-wip-1.patch	text/x-diff	219.5 KB

Responses

Re: Declarative partitioning at 2015-08-18 11:02:01 from Thom Brown
Re: Declarative partitioning at 2015-08-18 13:43:23 from David Fetter
Re: Declarative partitioning at 2015-08-18 17:31:38 from Josh Berkus
Re: Declarative partitioning at 2015-08-18 17:57:44 from Marc Mamin
Re: Declarative partitioning at 2015-08-18 17:59:41 from Corey Huinker
Re: Declarative partitioning at 2015-08-19 12:23:43 from Simon Riggs
Re: Declarative partitioning at 2015-08-20 09:27:50 from Pavan Deolasee
Re: Declarative partitioning at 2015-08-26 14:00:15 from Simon Riggs
Re: Declarative partitioning at 2015-08-26 14:01:58 from Simon Riggs
Re: Declarative partitioning at 2016-01-22 13:54:12 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thom Brown	2015-08-18 11:02:01	Re: Declarative partitioning
Previous Message	PostgreSQL - Hans-Jürgen Schönig	2015-08-18 10:07:10	Re: Proposal: Implement failover on libpq connect level.