Re: Declarative partitioning

From: Thom Brown <thom(at)linux(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning
Date: 2015-08-18 11:02:01
Message-ID: CAA-aLv6RmcdnRM0zc04FgksC6cpmXUYbDi-8X3qNCGr50iGswQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 18 August 2015 at 11:30, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
wrote:

>
> Hi,
>
> I would like propose $SUBJECT for this development cycle. Attached is a
> WIP patch that implements most if not all of what's described below. Some
> yet unaddressed parts are mentioned below, too. I'll add this to the
> CF-SEP.
>
> Syntax
> ======
>
> 1. Creating a partitioned table
>
> CREATE TABLE table_name
> PARTITION BY {RANGE|LIST}
> ON (column_list);
>
> Where column_list consists of simple column names or expressions:
>
> PARTITION BY LIST ON (name)
> PARTITION BY RANGE ON (year, month)
>
> PARTITION BY LIST ON ((lower(left(name, 2)))
> PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))
>
> Note: LIST partition key supports only one column.
>
> For each column, you could write operator class name:
>
> PARTITION BY LIST/RANGE ON (colname [USING] opclass_name),
>
> If not specified, the default btree operator class based on type of each
> key column is used. If none of the available btree operator classes are
> compatible with the partitioning strategy (list/range), error is thrown.
> Built-in btree operator classes cover a good number of types for list and
> range partitioning in practical scenarios.
>
> A table created using this form is of proposed new relkind
> RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is
> created to store partition key info.
>
> Note: A table cannot be partitioned after-the-fact using ALTER TABLE.
>
> Normal dependencies are created between the partitioned table and operator
> classes, object in partition expressions like functions.
>
> 2. Creating a partition of a partitioned table
>
> CREATE TABLE table_name
> PARTITION OF partitioned_table_name
> FOR VALUES values_spec;
>
> Where values_spec is:
>
> listvalues: [IN] (val1, ...)
>
> rangevalues: START (col1min, ... ) END (col1max, ... )
> | START (col1min, ... )
> | END (col1max, ... )
>
> A table created using this form has proposed pg_class.relispartition set
> to true. An entry in pg_partition (see below) is created to store the
> partition bound info.
>
> The values_spec should match the partitioning strategy of the partitioned
> table. In case of a range partition, the values in START and/or END should
> match columns in the partition key.
>
> Defining a list partition is fairly straightforward - just spell out the
> list of comma-separated values. Error is thrown if the list of values
> overlaps with one of the existing partitions' list.
>
> CREATE TABLE persons_by_state (name text, state text)
> PARTITION BY LIST ON (state);
>
> CREATE TABLE persons_IL
> PARTITION OF persons_by_state
> FOR VALUES IN ('IL');
>
> CREATE TABLE persons_fail
> PARTITION OF persons_by_state
> FOR VALUES IN ('IL');
> ERROR: cannot create partition that overlaps with an existing one
>
> For a range partition, there are more than one way:
>
> Specify both START and END bounds: resulting range should not overlap with
> the range(s) covered by existing partitions. Error is thrown otherwise.
> Although rare in practice, gaps between ranges are OK.
>
> CREATE TABLE measurement(logdate date NOT NULL)
> PARTITION BY RANGE ON (logdate);
>
> CREATE TABLE measurement_y2006m02
> PARTITION OF measurement
> FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success
>
> CREATE TABLE measurement_fail
> PARTITION OF measurement
> FOR VALUES START ('2006-02-15') END ('2006-03-01');
> ERROR: cannot create partition that overlaps with an existing one
>
> Specify only the START bound: add the partition on the left of some range
> covered by existing partitions provided no overlap occurs (also
> considering gaps between ranges, if any). If no such range exists, the new
> partition will cover the range [START, +INFINITY) and become the rightmost
> partition. Error is thrown if the specified START causes overlap.
>
> CREATE TABLE measurement_y2006m01
> PARTITION OF measurement
> FOR VALUES START ('2006-01-01'); --success
>
> CREATE TABLE measurement_fail
> PARTITION OF measurement
> FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02
> ERROR: cannot create partition that overlaps with an existing one
>
> Specify only the END bound: add the partition on the right of some range
> covered by existing partitions provided no overlap occurs (also
> considering gaps between ranges, if any). If no such range exists, the new
> partition would cover the range (-INFINITY, END) and become the leftmost
> partition. Error is thrown if the specified END causes overlap.
>
> CREATE TABLE measurement_y2006m03
> PARTITION OF measurement
> FOR VALUES END ('2006-04-01'); --success
>
> CREATE TABLE measurement_fail
> PARTITION OF measurement
> FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02
> ERROR: cannot create partition that overlaps with an existing one
>
> For each partition, START and END bound values are stored in the
> catalog. Note that the lower bound is inclusive, whereas the upper bound
> is exclusive.
>
> Note: At most one range partition can have null min bound in which case it
> covers the range (-INFINITY, END). Also, at most one range partition can
> have null max bound in which case it covers the range [START, +INFINITY).
>
> A normal dependency is created between the parent and the partition.
>
> 3. Multi-level partitioning
>
> CREATE TABLE table_name
> PARTITION OF partitioned_table_name
> FOR VALUES values_spec
> PARTITION BY {RANGE|LIST} ON (columns_list)
>
> This variant implements a form of so called composite or sub-partitioning
> with arbitrarily deep partitioning structure. A table created using this
> form has both the relkind RELKIND_PARTITIONED_REL and
> pg_class.relispartition set to true.
>
> 4. (yet unimplemented) Attach partition (from existing table)
>
> ALTER TABLE partitioned_table
> ATTACH PARTITION partition_name
> FOR VALUES values_spec
> USING [TABLE] table_name;
>
> ALTER TABLE table_name
> SET VALID PARTITION OF <parent>;
>
> The first of the above pair of commands would attach table_name as a (yet)
> 'invalid' partition of partitioned_table (after confirming that it matches
> the schema and does not overlap with other partitions per FOR VALUES
> spec). It would also record the FOR VALUES part in the partition catalog
> and set pg_class.relispartition to true for table_name.
>
> After the first command is done, the second command would take exclusive
> lock on table_name, scan the table to check if it contains any values
> outside the boundaries defined by FOR VALUES clause defined previously,
> throw error if so, mark as valid partition of parent if not.
>
> Does that make sense?
>
> 5. Detach partition
>
> ALTER TABLE partitioned_table
> DETACH PARTITION partition_name [USING table_name]
>
> This removes partition_name as partition of partitioned_table. The table
> continues to exist with the same name or 'table_name', if specified.
> pg_class.relispartition is set to false for the table, so it behaves like
> a normal table.
>
>
> System catalogs
> ===============
>
> 1. pg_partitioned_rel
>
> CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS
> {
> Oid partrelid; /* partitioned table pg_class.oid */
> char partstrategy; /* partitioning strategy 'l'/'r' */
> int16 partnatts; /* number of partition columns */
> int2vector partkey; /* column numbers of partition columns;
> * 0 where specified column is an
> * expresion */
> oidvector partclass; /* operator class to compare keys */
> pg_node_tree partexprs; /* expression trees for partition key
> * members that are not simple column
> * references; one for each zero entry
> * in partkey[] */
> };
>
>
> 2. pg_partition (omits partisvalid alluded to above)
>
> CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS
> {
> Oid partitionid; /* partition oid */
> Oid partparent; /* parent oid */
> anyarray partlistvalues; /* list of allowed values of the only
> * partition column */
> anyarray partrangebounds; /* list of bounds of ranges of
> * allowed values per partition key
> * column */
> };
>
>
> Further notes
> =============
>
> There are a number of restrictions on performing after-the-fact changes
> using ALTER TABLE to partitions (ie, relispartition=true):
>
> * Cannot add/drop column
> * Cannot set/drop OIDs
> * Cannot set/drop NOT NULL
> * Cannot set/drop default
> * Cannot alter column type
> * Cannot add/drop alter constraint (table level)
> * Cannot change persistence
> * Cannot change inheritance
> * Cannot link to a composite type
>
> Such changes should be made to the topmost parent in the partitioning
> hierarchy (hereafter referred to as just parent). These are recursively
> applied to all the tables in the hierarchy. Although the last two items
> cannot be performed on parent either.
>
> Dropping a partition using DROP TABLE is not allowed. It needs to detached
> using ALTER TABLE on parent before it can be dropped as a normal table.
>
> Triggers on partitions are not allowed. They should be defined on the
> parent. That said, I could not figure out a way to implement row-level
> AFTER triggers on partitioned tables (more about that in a moment); so
> they are currently not allowed:
>
> CREATE TRIGGER audit_trig
> AFTER INSERT ON persons
> FOR EACH ROW EXECUTE PROCEDURE audit_func();
> ERROR: Row-level AFTER triggers are not supported on partitioned tables
>
> Column/table constraints on partitions are not allowed. They should be
> defined on the parent. Foreign key constraints are not allowed due to
> above limitation (no row-level after triggers).
>
> A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage.
>
> Creating index on parent is not allowed. They should be defined on (leaf)
> partitions. Because of this limitation, primary keys are not allowed on a
> partitioned table. Perhaps, we should be able to just create a dummy
> entry somewhere to represent an index on parent (which every partition
> then copies.) Then by restricting primary key to contain all partition key
> columns, we can implement unique constraint over the whole partitioned
> table. That will in turn allow us to use partitioned tables as PK rels in
> a foreign key constraint provided row-level AFTER trigger issue is
> resolved.
>
> VACUUM/ANALYZE on individual partitions should work like normal tables.
> I've not implemented something like inheritance tree sampling for
> partitioning tree in this patch. Autovacuum has been taught to ignore
> parent tables and vacuum/analyze partitions normally.
>
> Dropping a partitioned table should (?) unconditionally drop all its
> partitions probably but, currently the patch uses dependencies, so
> requires to specify CASCADE to do the same.
>
> What should TRUNCATE on partitioned table do?
>
> Ownership, privileges/permissions, RLS should be managed through the
> parent table although not comprehensively addressed in the patch.
>
> There is no need to define tuple routing triggers. CopyFrom() and
> ExecInsert() determine target partition just before performing
> heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
> constraints (on parent) are executed for tuple before being routed to a
> partition. If no partition can be found, it's an error.
>
> Because row-level AFTER triggers need to save ItemPointers in trigger
> event data and defining triggers on partitions (which is where tuples
> really go) is not allowed, I could not find a straightforward way to
> implement them. So, perhaps we should allow (only) row-level AFTER
> triggers on partitions or think of modifying trigger.c to know about this
> twist explicitly.
>
> Internal representations
> ========================
>
> For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new
> fields to store the partitioning metadata. That includes partition key
> tuple (pg_partitioned_rel) including some derived info (opfamily,
> opcintype, compare proc FmgrInfos, partition expression trees).
>
> Additionally, it also includes a PartitionInfo object which includes
> partition OIDs array, partition bound arrays (if range partitioned,
> rangemax is sorted in ascending order and OIDs are likewise ordered). It
> is built from information in pg_partition catalog.
>
> While RelationBuildDesc() initializes the basic key info, fields like
> expression trees, PartitionInfo are built on demand and cached. For
> example, InitResultRelInfo() builds the latter to populate the newly added
> ri_PartitionKeyInfo and ri_Partitions fields, respectively.
>
> PartitionInfo object is rebuilt on every cache invalidation of the rel
> which includes when adding/attaching/detaching a new partition.
>
> Planner and executor considerations
> =====================================
>
> The patch does not yet implement any planner changes for partitioned
> tables, although I'm working on the same and post updates as soon as
> possible. That means, it is not possible to run SELECT/UPDATE/DELETE
> queries on partitioned tables without getting:
>
> postgres=# SELECT * FROM persons;
> ERROR: could not open file "base/13244/106975": No such file or directory
>
> Given that there would be more direct ways of performing partition pruning
> decisions with the proposed, it would be nice to utilize them.
> Specifically, I would like to avoid having to rely on constraint exclusion
> for partition pruning whereby subquery_planner() builds append_rel_list
> and the later steps exclude useless partitions.
>
> By extending RelOptInfo to include partitioning info for partitioned rels,
> it might be possible to perform partition pruning directly without
> previously having to expand them. Although, as things stand now, it's not
> clear how that might work - when would partition RTEs be added to the
> rtable? The rtable is assumed not to change after
> setup_simple_rel_arrays() has done its job which is much earlier than when
> it would be desirable for the partitioned table expansion (along with
> partition pruning) to happen. Moreover, if that means we might not be able
> to build RelOptInfo's for partitions, how to choose best paths for them
> (index paths or not, etc.)?
>
> I'm also hoping we don't require something like inheritance_planner() for
> when partitioned tables are target rels. I assume considerations for why
> the special processing is necessary for inheritance trees in that scenario
> don't apply to partitioning trees. So, if grouping_planner() returns a
> Append plan (among other options) for the partitioning tree, tacking a
> ModifyTable node on top should do the trick?
>
> Suggestions greatly welcome in this area.
>
> Other items
> ===========
>
> Will include the following once we start reaching consensus on main parts
> of the proposed design/implementation:
>
> * New regression tests
> * Documentation updates
> * pg_dump, psql, etc.
>
> For reference, some immediately previous discussions:
>
> * On partitioning *
>
> http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org
>
> * Partitioning WIP patch *
> http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp
>
> Comments welcome!
>
> Thanks,
> Amit
>

Wow, didn't expect to see that email this morning.

A very quick test:

CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item
text) partition by range on ((extract(year from
purchase_time)),(extract(month from purchase_time)));
ERROR: referenced relation "purchases" is not a table or foreign table

Thom

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2015-08-18 11:23:45 Re: Declarative partitioning
Previous Message Amit Langote 2015-08-18 10:30:20 Declarative partitioning