Re: Declarative partitioning

From: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning
Date: 2015-08-18 17:59:41
Message-ID: CADkLM=ezDHuQmonRhMyRm12MoZtgQq9SYBkUxqZRLbkOFjuWiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 18, 2015 at 6:30 AM, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp
> wrote:

>
> Hi,
>
> I would like propose $SUBJECT for this development cycle. Attached is a
> WIP patch that implements most if not all of what's described below. Some
> yet unaddressed parts are mentioned below, too. I'll add this to the
> CF-SEP.
>
> Syntax
> ======
>
> 1. Creating a partitioned table
>
> CREATE TABLE table_name
> PARTITION BY {RANGE|LIST}
> ON (column_list);
>
> Where column_list consists of simple column names or expressions:
>
> PARTITION BY LIST ON (name)
> PARTITION BY RANGE ON (year, month)
>
> PARTITION BY LIST ON ((lower(left(name, 2)))
> PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d)))
>
> Note: LIST partition key supports only one column.
>
> For each column, you could write operator class name:
>
> PARTITION BY LIST/RANGE ON (colname [USING] opclass_name),
>
> If not specified, the default btree operator class based on type of each
> key column is used. If none of the available btree operator classes are
> compatible with the partitioning strategy (list/range), error is thrown.
> Built-in btree operator classes cover a good number of types for list and
> range partitioning in practical scenarios.
>
> A table created using this form is of proposed new relkind
> RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is
> created to store partition key info.
>
> Note: A table cannot be partitioned after-the-fact using ALTER TABLE.
>
> Normal dependencies are created between the partitioned table and operator
> classes, object in partition expressions like functions.
>
> 2. Creating a partition of a partitioned table
>
> CREATE TABLE table_name
> PARTITION OF partitioned_table_name
> FOR VALUES values_spec;
>
> Where values_spec is:
>
> listvalues: [IN] (val1, ...)
>
> rangevalues: START (col1min, ... ) END (col1max, ... )
> | START (col1min, ... )
> | END (col1max, ... )
>
> A table created using this form has proposed pg_class.relispartition set
> to true. An entry in pg_partition (see below) is created to store the
> partition bound info.
>
> The values_spec should match the partitioning strategy of the partitioned
> table. In case of a range partition, the values in START and/or END should
> match columns in the partition key.
>
> Defining a list partition is fairly straightforward - just spell out the
> list of comma-separated values. Error is thrown if the list of values
> overlaps with one of the existing partitions' list.
>
> CREATE TABLE persons_by_state (name text, state text)
> PARTITION BY LIST ON (state);
>
> CREATE TABLE persons_IL
> PARTITION OF persons_by_state
> FOR VALUES IN ('IL');
>
> CREATE TABLE persons_fail
> PARTITION OF persons_by_state
> FOR VALUES IN ('IL');
> ERROR: cannot create partition that overlaps with an existing one
>
> For a range partition, there are more than one way:
>
> Specify both START and END bounds: resulting range should not overlap with
> the range(s) covered by existing partitions. Error is thrown otherwise.
> Although rare in practice, gaps between ranges are OK.
>
> CREATE TABLE measurement(logdate date NOT NULL)
> PARTITION BY RANGE ON (logdate);
>
> CREATE TABLE measurement_y2006m02
> PARTITION OF measurement
> FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success
>
> CREATE TABLE measurement_fail
> PARTITION OF measurement
> FOR VALUES START ('2006-02-15') END ('2006-03-01');
> ERROR: cannot create partition that overlaps with an existing one
>
> Specify only the START bound: add the partition on the left of some range
> covered by existing partitions provided no overlap occurs (also
> considering gaps between ranges, if any). If no such range exists, the new
> partition will cover the range [START, +INFINITY) and become the rightmost
> partition. Error is thrown if the specified START causes overlap.
>
> CREATE TABLE measurement_y2006m01
> PARTITION OF measurement
> FOR VALUES START ('2006-01-01'); --success
>
> CREATE TABLE measurement_fail
> PARTITION OF measurement
> FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02
> ERROR: cannot create partition that overlaps with an existing one
>
> Specify only the END bound: add the partition on the right of some range
> covered by existing partitions provided no overlap occurs (also
> considering gaps between ranges, if any). If no such range exists, the new
> partition would cover the range (-INFINITY, END) and become the leftmost
> partition. Error is thrown if the specified END causes overlap.
>
> CREATE TABLE measurement_y2006m03
> PARTITION OF measurement
> FOR VALUES END ('2006-04-01'); --success
>
> CREATE TABLE measurement_fail
> PARTITION OF measurement
> FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02
> ERROR: cannot create partition that overlaps with an existing one
>
> For each partition, START and END bound values are stored in the
> catalog. Note that the lower bound is inclusive, whereas the upper bound
> is exclusive.
>
> Note: At most one range partition can have null min bound in which case it
> covers the range (-INFINITY, END). Also, at most one range partition can
> have null max bound in which case it covers the range [START, +INFINITY).
>
> A normal dependency is created between the parent and the partition.
>
> 3. Multi-level partitioning
>
> CREATE TABLE table_name
> PARTITION OF partitioned_table_name
> FOR VALUES values_spec
> PARTITION BY {RANGE|LIST} ON (columns_list)
>
> This variant implements a form of so called composite or sub-partitioning
> with arbitrarily deep partitioning structure. A table created using this
> form has both the relkind RELKIND_PARTITIONED_REL and
> pg_class.relispartition set to true.
>
> 4. (yet unimplemented) Attach partition (from existing table)
>
> ALTER TABLE partitioned_table
> ATTACH PARTITION partition_name
> FOR VALUES values_spec
> USING [TABLE] table_name;
>
> ALTER TABLE table_name
> SET VALID PARTITION OF <parent>;
>
> The first of the above pair of commands would attach table_name as a (yet)
> 'invalid' partition of partitioned_table (after confirming that it matches
> the schema and does not overlap with other partitions per FOR VALUES
> spec). It would also record the FOR VALUES part in the partition catalog
> and set pg_class.relispartition to true for table_name.
>
> After the first command is done, the second command would take exclusive
> lock on table_name, scan the table to check if it contains any values
> outside the boundaries defined by FOR VALUES clause defined previously,
> throw error if so, mark as valid partition of parent if not.
>
> Does that make sense?
>
> 5. Detach partition
>
> ALTER TABLE partitioned_table
> DETACH PARTITION partition_name [USING table_name]
>
> This removes partition_name as partition of partitioned_table. The table
> continues to exist with the same name or 'table_name', if specified.
> pg_class.relispartition is set to false for the table, so it behaves like
> a normal table.
>
>
> System catalogs
> ===============
>
> 1. pg_partitioned_rel
>
> CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS
> {
> Oid partrelid; /* partitioned table pg_class.oid */
> char partstrategy; /* partitioning strategy 'l'/'r' */
> int16 partnatts; /* number of partition columns */
> int2vector partkey; /* column numbers of partition columns;
> * 0 where specified column is an
> * expresion */
> oidvector partclass; /* operator class to compare keys */
> pg_node_tree partexprs; /* expression trees for partition key
> * members that are not simple column
> * references; one for each zero entry
> * in partkey[] */
> };
>
>
> 2. pg_partition (omits partisvalid alluded to above)
>
> CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS
> {
> Oid partitionid; /* partition oid */
> Oid partparent; /* parent oid */
> anyarray partlistvalues; /* list of allowed values of the only
> * partition column */
> anyarray partrangebounds; /* list of bounds of ranges of
> * allowed values per partition key
> * column */
> };
>
>
> Further notes
> =============
>
> There are a number of restrictions on performing after-the-fact changes
> using ALTER TABLE to partitions (ie, relispartition=true):
>
> * Cannot add/drop column
> * Cannot set/drop OIDs
> * Cannot set/drop NOT NULL
> * Cannot set/drop default
> * Cannot alter column type
> * Cannot add/drop alter constraint (table level)
> * Cannot change persistence
> * Cannot change inheritance
> * Cannot link to a composite type
>
> Such changes should be made to the topmost parent in the partitioning
> hierarchy (hereafter referred to as just parent). These are recursively
> applied to all the tables in the hierarchy. Although the last two items
> cannot be performed on parent either.
>
> Dropping a partition using DROP TABLE is not allowed. It needs to detached
> using ALTER TABLE on parent before it can be dropped as a normal table.
>
> Triggers on partitions are not allowed. They should be defined on the
> parent. That said, I could not figure out a way to implement row-level
> AFTER triggers on partitioned tables (more about that in a moment); so
> they are currently not allowed:
>
> CREATE TRIGGER audit_trig
> AFTER INSERT ON persons
> FOR EACH ROW EXECUTE PROCEDURE audit_func();
> ERROR: Row-level AFTER triggers are not supported on partitioned tables
>
> Column/table constraints on partitions are not allowed. They should be
> defined on the parent. Foreign key constraints are not allowed due to
> above limitation (no row-level after triggers).
>
> A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage.
>
> Creating index on parent is not allowed. They should be defined on (leaf)
> partitions. Because of this limitation, primary keys are not allowed on a
> partitioned table. Perhaps, we should be able to just create a dummy
> entry somewhere to represent an index on parent (which every partition
> then copies.) Then by restricting primary key to contain all partition key
> columns, we can implement unique constraint over the whole partitioned
> table. That will in turn allow us to use partitioned tables as PK rels in
> a foreign key constraint provided row-level AFTER trigger issue is
> resolved.
>
> VACUUM/ANALYZE on individual partitions should work like normal tables.
> I've not implemented something like inheritance tree sampling for
> partitioning tree in this patch. Autovacuum has been taught to ignore
> parent tables and vacuum/analyze partitions normally.
>
> Dropping a partitioned table should (?) unconditionally drop all its
> partitions probably but, currently the patch uses dependencies, so
> requires to specify CASCADE to do the same.
>
> What should TRUNCATE on partitioned table do?
>
> Ownership, privileges/permissions, RLS should be managed through the
> parent table although not comprehensively addressed in the patch.
>
> There is no need to define tuple routing triggers. CopyFrom() and
> ExecInsert() determine target partition just before performing
> heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
> constraints (on parent) are executed for tuple before being routed to a
> partition. If no partition can be found, it's an error.
>
> Because row-level AFTER triggers need to save ItemPointers in trigger
> event data and defining triggers on partitions (which is where tuples
> really go) is not allowed, I could not find a straightforward way to
> implement them. So, perhaps we should allow (only) row-level AFTER
> triggers on partitions or think of modifying trigger.c to know about this
> twist explicitly.
>
> Internal representations
> ========================
>
> For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new
> fields to store the partitioning metadata. That includes partition key
> tuple (pg_partitioned_rel) including some derived info (opfamily,
> opcintype, compare proc FmgrInfos, partition expression trees).
>
> Additionally, it also includes a PartitionInfo object which includes
> partition OIDs array, partition bound arrays (if range partitioned,
> rangemax is sorted in ascending order and OIDs are likewise ordered). It
> is built from information in pg_partition catalog.
>
> While RelationBuildDesc() initializes the basic key info, fields like
> expression trees, PartitionInfo are built on demand and cached. For
> example, InitResultRelInfo() builds the latter to populate the newly added
> ri_PartitionKeyInfo and ri_Partitions fields, respectively.
>
> PartitionInfo object is rebuilt on every cache invalidation of the rel
> which includes when adding/attaching/detaching a new partition.
>
> Planner and executor considerations
> =====================================
>
> The patch does not yet implement any planner changes for partitioned
> tables, although I'm working on the same and post updates as soon as
> possible. That means, it is not possible to run SELECT/UPDATE/DELETE
> queries on partitioned tables without getting:
>
> postgres=# SELECT * FROM persons;
> ERROR: could not open file "base/13244/106975": No such file or directory
>
> Given that there would be more direct ways of performing partition pruning
> decisions with the proposed, it would be nice to utilize them.
> Specifically, I would like to avoid having to rely on constraint exclusion
> for partition pruning whereby subquery_planner() builds append_rel_list
> and the later steps exclude useless partitions.
>
> By extending RelOptInfo to include partitioning info for partitioned rels,
> it might be possible to perform partition pruning directly without
> previously having to expand them. Although, as things stand now, it's not
> clear how that might work - when would partition RTEs be added to the
> rtable? The rtable is assumed not to change after
> setup_simple_rel_arrays() has done its job which is much earlier than when
> it would be desirable for the partitioned table expansion (along with
> partition pruning) to happen. Moreover, if that means we might not be able
> to build RelOptInfo's for partitions, how to choose best paths for them
> (index paths or not, etc.)?
>
> I'm also hoping we don't require something like inheritance_planner() for
> when partitioned tables are target rels. I assume considerations for why
> the special processing is necessary for inheritance trees in that scenario
> don't apply to partitioning trees. So, if grouping_planner() returns a
> Append plan (among other options) for the partitioning tree, tacking a
> ModifyTable node on top should do the trick?
>
> Suggestions greatly welcome in this area.
>
> Other items
> ===========
>
> Will include the following once we start reaching consensus on main parts
> of the proposed design/implementation:
>
> * New regression tests
> * Documentation updates
> * pg_dump, psql, etc.
>
> For reference, some immediately previous discussions:
>
> * On partitioning *
>
> http://www.postgresql.org/message-id/20140829155607.GF7705@eldon.alvh.no-ip.org
>
> * Partitioning WIP patch *
> http://www.postgresql.org/message-id/54EC32B6.9070605@lab.ntt.co.jp
>
> Comments welcome!
>
> Thanks,
> Amit
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

Quick thoughts borne of years of slugging it out with partitions on Oracle:

- Finally!!!!!!!!!!!

- Your range partitioning will need to express exclusive/inclusive bounds,
or go to the Oracle model where every partition is a cascading "values less
than" test context dependent on the partitions defined before it. I would
suggest that leveraging existing range types (or allowing the user to
specify a range type, like for a specific collation of a text range) would
allow for the most flexible and postgres-ish range definition. You seem to
do this with the "[USING] opclass_name" bit, but don't follow through on
the START(...) and END(...). Something like FOR VALUES <@
'[''2014-01-01'',''2015-01-01)'::daterange would cover most needs
succinctly, though I admit the syntax for complex ranges could be
cumbersome, though something like FOR VALUES <@
'[(''a'',1),(''b'',1))'::letter_then_number_range is still readable.

- No partitioning scheme survives first contact with reality. So you will
need a facility for splitting and joining existing partitions. For
splitting partitions, it's sufficient to require that the new partition
share either a upper/lower bound (with the same inclusivity/exclusivity) of
an existing partition, thus uniquely identifying the partition to be split,
and require that the other bound be within the range of the partition to be
split. Similarly, it's fair to require that the partitions to be joined be
adjacent in range. In both cases, range operators make these tests simple.

- Your features 4 and 5 are implemented in Oracle with SWAP PARTITION,
which is really neat for doing ETLs and index rebuilds offline in a copy
table, and then swapping the data segment of that table with the partition
specified. Which could be considered cheating because none of the partition
metadata changed, just the pointers to the segments. We already do this
with adding removing INHERIT. I'm not saying they can't be separate
functionality, but keeping an atomic SWAP operation would be grand.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2015-08-18 18:07:30 Re: WIP: SCRAM authentication
Previous Message Marc Mamin 2015-08-18 17:57:44 Re: Declarative partitioning