Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-10-28 07:53:54
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016/10/05 10:50, Robert Haas wrote:
> On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> [ latest patch set ]
> Reviewing 0003:

Thanks a lot for the review and sorry about the delay in replying.

> + This form attaches an existing table (partitioned or otherwise) as
> (which might itself be partitioned)
> + partition of the target table. Partition bound specification must
> + correspond with the partition method and the key of the target table.
> The partition bound specification must correspond to the partitioning
> method and partitioning key of the target table.
> + The table being attached must have all the columns of the target table
> + with matching types and no more. Also, it must have all the matching
> The table to be attached must have all of the same columns as the
> target table and no more; moreover, the column types must also match.


> + with matching types and no more. Also, it must have all the matching
> + constraints as the target table. That includes both <literal>NOT NULL</>
> + and <literal>CHECK</> constraints. If some <literal>CHECK</literal>
> + constraint of the table being attached is marked <literal>NO
> INHERIT</literal>,
> + the command will fail; such constraints must either be dropped or
> + recreated without the <literal>NO INHERIT</literal> mark.
> Why all of these requirements?

Except the last one, regular inheritance currently has the same
requirements; from the docs:

To be added as a child, the target table must already contain all the same
columns as the parent (it could have additional columns, too). The columns
must have matching data types, and if they have NOT NULL constraints in
the parent then they must also have NOT NULL constraints in the child.

There must also be matching child-table constraints for all CHECK
constraints of the parent, except those marked non-inheritable (that is,
created with ALTER TABLE ... ADD CONSTRAINT ... NO INHERIT) in the parent,
which are ignored; all child-table constraints matched must not be marked

Note that the "(it could have additional columns, too)" part does not
apply for partitions. Also the last point does not hold for partitioned
tables because they never have NO INHERIT constraints.

> We could instead perform a scan to
> validate that the constraints are met. I think the way this should
> work is:
> 1. ATTACH PARTITION works whether matching NOT NULL and CHECK
> constraints are present or not.

Wouldn't this break the invariant that if the parent has a NOT NULL or a
CHECK constraint, all of its children must have it too? Remember that we
allow directly inserting/updating data into partitions, so these
constraints better be persistent.

> 2. If all of the constraints are present, and a validated constraint
> matching the implicit partitioning constraint is also present, then
> ATTACH PARTITION does not scan the table to validate constraints;
> otherwise, it does.

I wonder what matching in "validated constraint matching with the implicit
partitioning constraint" means? By that do you mean an exact equal()
match, provided both have been suitably canonicalized? Or an existing
check constraint that *implies* the partitioning constraint?

> 3. NO VALIDATE is not an option.

I had my doubts about this option initially, but 2 above seemed hard to
do. Or perhaps we can just remove this option and we will always scan
when attaching a partition.

> + Currently <literal>UNIQUE</literal>, <literal>PRIMARY KEY</literal>, and
> + <literal>FOREIGN KEY</literal> constraints are not considered, but that
> + might change in the future.
> Really? Changing that sounds impractical to me.

Assuming you are questioning the "but that might change in the future"
part, I took that line from the description for INHERIT. Maybe it was
written in that section with a view that someone might implement globally
enforced variant of each of those constraints. I thought the path to that
would be slightly easier with declarative partitioning but that may be too
optimistic. Removed that part.

> + This form detaches specified partition of the target table. The
> + detached partition continues to exist as a standalone table with no ties
> + remaining with the target table.
> continues to exist as a standalone table, but no longer has any ties
> to the table from which it was detached.


> + Note that if a partition being detached is itself a partitioned table,
> + it continues to exist as such.
> You don't really need to say this, I think. All of the properties of
> the detached table are retained, not only its partitioning status.
> You wouldn't like it if I told you to document "note that if a
> partition being detached is unlogged, it will still be unlogged".

I thought same as Petr who replied to this part of your email - might help
avoid confusion. But it might be redundant, so I removed that line.

> To add the table as a new child of a parent table, you must own the
> - parent table as well.
> + parent table as well. That applies to both adding the table as a
> + inheritance child of a parent table and attaching a table as partition to
> + the table.
> To add the table as a new child of a parent table, or as a new
> partition of a partitioned table, you must own the parent table as
> well.


> + The name of the table to attach as a new partition to or
> detach from this table.
> s/to or/or to/

Oops, fixed.

> + <literal>NO VALIDATE</> option is spcified.
> Typo, but see comments above about nuking this altogether.

Fixed the typo for now. As I said above it's not clear to me how the
alternative method of skipping the scan is supposed to work.

> A recursive <literal>DROP COLUMN</literal> operation will remove a
> descendant table's column only if the descendant does not inherit
> that column from any other parents and never had an independent
> - definition of the column. A nonrecursive <literal>DROP
> + definition of the column (which always holds if the descendant table
> + is a partition). A nonrecursive <literal>DROP
> COLUMN</literal> (i.e., <command>ALTER TABLE ONLY ... DROP
> COLUMN</command>) never removes any descendant columns, but
> - instead marks them as independently defined rather than inherited.
> + instead marks them as independently defined rather than inherited,
> + unless the descendant table is a partition.
> This is a hairy explanation. I suggest that the documentation say
> this (and that the code implement it): A nonrecursive DROP TABLE
> command will fail for a partitioned table, because all partitions of a
> table must have the same columns as the partitioning root.

Agreed, done (including the code).

> - that are not marked <literal>NO INHERIT</>.
> + that are not marked <literal>NO INHERIT</> which are unsupported if
> + the table is a partitioned table.
> I think you can omit this hunk.


> + If <literal>PARTITION OF</literal> clause is specified then the table is
> + created as a partition of <literal>parent_table</literal> with specified
> + bounds. However, unlike regular tables, one cannot specify
> + <literal>PARTITION BY</literal> clause which means foreign tables can
> + only be created as leaf partitions.
> I'd delete the sentence beginning with "However".


> + Create foreign table <structname>measurement_y2016m07</>, which will be
> + accessed through the server <structname>server_07</>, that is partition
> + of the range partitioned table <structname>measurement</>:
> s/, that is/ as a/


> +<phrase>and <replaceable
> class="PARAMETER">partition_bound_spec</replaceable> is:</phrase>
> +
> +FOR VALUES { <replaceable class="PARAMETER">list_spec</replaceable> |
> <replaceable class="PARAMETER">range_spec</replaceable> }
> I think you can inline the definitions of list_spec and range_spec
> here instead of making them separate productions, and I think that
> would be preferable.
> FOR VALUES { IN ( <replaceable
> class="PARAMETER">expression</replaceable> [, ...] ) |
> START <replaceable class="PARAMETER">lower-bound</replaceable> [
> INCLUSIVE | EXCLUSIVE ] END <replaceable
> class="PARAMETER">upper-bound</replaceable> [ INCLUSIVE | EXCLUSIVE ]
> }

Ah, done that way.

> + parent table (name optionally schema-qualified).
> Parenthetical phrase isn't needed.


> + A partition bound specification must be present and must correspond with
> + partition method and key of the parent table. It is checked using the
> + specification that the new partition does not overlap with any existing
> + partitions of the parent.
> The partition bound specification must correspond to the partitioning
> method and partitioning key of the parent table, and must not overlap
> with any existing partition of that parent.


> + clause, if any. Defaults and constraints can optionally be specified
> + for each of the inherited columns, which override those in the parent.
> Surely not. You can't "override" an inherited constraint or an
> inherited default. The child may have extra constraints not present
> in the parent, and may have different defaults when it is directly
> targeted by an insertion, but it can't possibly override the parent
> defaults.

Removed ", which override those in the parent.".

> + One can also specify table constraints, in addition to those inherited
> + from the parent. Note that all subsequent schema modifications to the
> + parent propagate to partition.
> The first part of this seems right, but then what's going on with the
> reference to constraints in the previous sentence? (See previous
> review comment.) The second sentence I would delete (but see below).

In the previous sentence, I meant the constraints that one can specify
using WITH OPTIONS [ column_constraint [...] ], but I removed
the "overrides those in the parent" part as mentioned.

Removed the second sentence and instead adopted your text below.

> + <para>
> + Any data row subsequently inserted into the parent table is mapped to
> + and stored in the partition, provided partition key of the row falls
> + within the partition bounds.
> + </para>
> How about: Rows inserted into a partitioned table will be
> automatically routed to the correct partition. If no suitable
> partition exists, an error will occur.

Much better, done.

> + <para>
> + A partition is dropped or truncated when the parent table is dropped or
> + truncated. Dropping it directly using <literal>DROP TABLE</literal>
> + will fail; it must first be <firstterm>detached</> from the parent.
> + However, truncating a partition directly works.
> + </para>
> How about: A partition must have the same column names and types as
> the table of which it is a partition. Therefore, modifications the
> column names or types of the partitioned table will automatically
> propagate to all children, as will operations such as TRUNCATE which
> normally affect a table and all of its inheritance children. It is
> also possible to TRUNCATE a partition individually, just as for an
> inheritance child.


> Insisting that you can't drop a child without detaching it first seems
> wrong to me. If I already made this comment and you responded to it,
> please point me back to whatever you said. However, my feeling is
> this is flat wrong and absolutely must be changed.

I said the following [1]:

| Hmm, I don't think I like this. Why should it be necessary to detach
| a partition before dropping it? That seems like an unnecessary step.

I thought we had better lock the parent table when removing one of its
partitions and it seemed a bit odd to lock the parent table when dropping
a partition using DROP TABLE? OTOH, with ALTER TABLE parent DETACH
PARTITION, the parent table is locked anyway.

> - if (is_no_inherit)
> +
> + /* Discard the NO INHERIT flag if the relation is a partition */
> + if (is_no_inherit && !rel->rd_rel->relispartition)
> Something about this seems fishy. In StoreRelCheck(), you disallow
> the addition of a NO INHERIT constraint on a partition, but here you
> seem happy to accept it and ignore the NO INHERIT property. That
> doesn't seem consistent. We should adopt a consistent policy about
> what to do about such constraints, and I submit that throwing an error
> is better than silently changing things, unless you have some reason
> for doing that which I'm not seeing. Anyway, we should have the same
> behavior in both cases.

Allowing to add NO INHERIT constraints to (leaf) partitions does not seem
like it will cause any problems not that the flag will have any meaning.
It will come into play only in one case - in the form of causing an error
if a constraint with the same name is added to the parent.

With that in mind, I removed the check in StoreRelCheck() that you mention
here and also removed the above hunk to which your comment was addressed.

> We should also standardize on what value of conislocal and coninhcount
> children end up with; presumably the latter should be 1, but I'm not
> sure if the former should be true or false. In either case, anything
> that can vary between children probably needs to be dumped, so let's
> enforce that it doesn't so we don't have to dump it. I'm not sure
> whether the code here achieves those objectives, though, and note the
> comment in the function header about making sure the logic here
> matches MergeConstraintsIntoExisting.

To summarize:

If a table is partition, it doesn't have any local attributes. Also, it
doesn't have any check constraint that is both local *and* inherited.
Only the non-inherited constraints are ever local.

So, all parent attributes will be present present in all the partitions at
any given time with attislocal = false and attinhcount = 1. Likewise for
inherited constraints with conislocal = false and coninhcount = 1.

With above rules in place, pg_dump no longer dumps attributes or inherited
constraints in CREATE TABLE commands of individual partitions.

> I think the overriding principle here should be: If you attach a table
> as a partition, it must not be part of a standard inheritance
> hierarchy, and it must not be a partition of any other table. It can,
> however, be partitioned itself. If you later detach a partition, it
> ends up as a standalone table with a copy of each constraint it had as
> a partition - probably including the implicit partition constraint.
> The DBA can drop those constraints if they're not wanted.

Check. About implicit constraints, see below.

> I wonder if it's really a good idea for the partition constraints to
> be implicit; what is the benefit of leaving those uncatalogued?

I did start out that way - ie, catalogued implicit constraints, but later
thought it might not be good to end up with multiple copies of essentially
the same information. With cataloguing them will come dependencies and
all places that know about pg_constraint.

In the long term, I think we're only going to need them because we want to
enforce them when directly inserting data into partitions.

So, detaching a partition does not emit a check constraint matching the
implicit partition constraint.

> + * Depending on whether the relation in question is list or range
> + * partitioned, one of the fields is set.
> + */
> +typedef struct BoundCollectionData
> +{
> + struct ListInfo *listinfo;
> + struct RangeInfo *rangeinfo;
> +} BoundCollectionData;
> This seems like an odd design. First, when you have a pointer to
> either of two things, the normal tool for that in C would be a union,
> not a struct. Second, in PostgreSQL we typically handle that by making
> both of the things nodes and then you can use IsA() or switch on
> nodeTag() to figure out what you've got. Third, the only place this
> is used at least in 0003 is as part of PartitionDescData, which only
> has 3 members, so if you were going to do it with 2 members, you could
> just include these two members directly. Considering all of the
> foregoing, I'd suggest making this a union and including partstrategy
> in PartitionDescData.
> I think that the names ListInfo and RangeInfo are far too generic for
> something that's specific to partitioning.

How about this:

* Collection of bounds of a partitioned relation (either physical or
* logical)
typedef struct BoundCollectionData
char strategy; /* list or range bounds? */

struct PartitionListInfo lists;
struct PartitionRangeInfo ranges;
} bounds;
} BoundCollectionData;

I added the strategy field here instead of PartitionDescData, because this
structure is supposed to represent even the transient partitioned
relations. If the relation is a physical relation we can determine
strategy from PartitionKey.

I renamed ListInfo/RangeInfo to PartitionListInfo/PartitionRangeInfo.

> +/*
> + * Range bound collection - sorted array of ranges of partitions of a range
> + * partitioned table
> + */
> +typedef struct RangeInfo
> +{
> + struct PartitionRange **ranges;
> +} RangeInfo;
> +
> +/* One partition's range */
> +typedef struct PartitionRange
> +{
> + struct PartitionRangeBound *lower;
> + struct PartitionRangeBound *upper;
> +} PartitionRange;
> This representation doesn't seem optimal, because in most cases the
> lower bound of one partition will be the upper bound of the next. I
> suggest that you flatten this representation into a single list of
> relevant bounds, each flagged as to whether it is exclusive and
> whether it is finite; and a list with one more element of bounds. For
> example, suppose the partition bounds are [10, 20), [20, 30), (30,
> 40), and [50, 60). You first construct a list of all of the distinct
> bounds, flipping inclusive/exclusive for the lefthand bound in each
> case. So you get:
> When ordering items for this list, if the same item appears twice, the
> EXCLUSIVE copy always appears before INCLUSIVE. When comparing
> against an EXCLUSIVE item, we move to the first half of the array if
> we are searching for a value strictly less than that value; when
> comparing against an INCLUSIVE item, we move to the first half of the
> array if we are searching for a value less than or equal to that
> value.
> This is a list of seven items, so a binary search will return a
> position between 0 (less than all items) and 7 (greater than all
> items). So we need a list of 8 partition mappings, which in this case
> will look like this: -1, 0, 1, -1, 2, -1, 3, -1.
> In this particular example, there are only two adjacent partitions, so
> we end up with 7 bounds with this representation vs. 8 with yours, but
> in general I think the gains will be bigger. If you've got 1000
> partitions and they're all adjacent, which is likely, you store 1000
> bounds instead of 2000 bounds by doing it this way.

Thanks for the idea. I have implemented the same.

> + * Note: This function should be called only when it is known that 'relid'
> + * is a partition.
> Why? How about "Because this function assumes that the relation whose
> OID is passed as an argument will have precisely one parent, it should
> only been called when it is known that the relation is a partition."

Rewrote the note.

> + /*
> + * Translate vars in the generated expression to have correct attnos.
> + * Note that the vars in my_qual bear attnos dictated by key which carries
> + * physical attnos of the parent. We must allow for a case where physical
> + * attnos of a partition can be different from the parent.
> + */
> + partexprs_item = list_head(key->partexprs);
> + for (i = 0; i < key->partnatts; i++)
> + {
> + AttrNumber attno = key->partattrs[i],
> + new_attno;
> + char *attname;
> +
> + if (attno != 0)
> + {
> + /* Simple column reference */
> + attname = get_attname(RelationGetRelid(parent), attno);
> + new_attno = get_attnum(RelationGetRelid(rel), attname);
> +
> + if (new_attno != attno)
> + my_qual = (List *) translate_var_attno((Node *) my_qual,
> + attno,
> + new_attno);
> It can't really be safe to do this one attribute number at a time, or
> even if by some good fortune it can't be broken, it at least it seems
> extremely fragile. Suppose that you translate 0 -> 3 and then 3 -> 5;
> now the result is garbage. It's not very efficient to do this one
> attno at a time, either.

You are quite right. I changed this to us map_variable_attnos().

When doing that, I noticed that the above function does not map whole-row
references but instead just returns whether one was encountered, so that
the caller can output an error that expressions containing whole-row vars
are not allowed in the corresponding context (examples of such callers
include transformTableLikeClause()). In context of this function, there
is no way we could output such error.

Thinking more about that, even if we did map whole-row vars in check
expressions, constraint exclusion code would not be able to handle them,
because it does not work with anything but Const node as predicate's or
query clause's right-operand.

I went ahead and prohibited whole-row references from being used in the
partition key expressions in the first place (ie, in patch 0001).

> + if (classform->relispartition)
> + ereport(ERROR,
> + errmsg("\"%s\" is a partition of \"%s\"", rel->relname,
> + get_rel_name(get_partition_parent(relOid))),
> + errhint("Use ALTER TABLE DETACH PARTITION to be able
> to drop it.")));
> +
> RangeVarCallbackForDropRelation should do only minimal sanity checks;
> defer this until after we have a relation lock.

Moved this check to RemoveRelations(), wherein after
RangeVarGetRelidExtended() returns, the relation is guaranteed to be locked.

> I didn't get all the way through this patch, so this is a pretty
> incomplete review, but it's late here and I'm out of steam for
> tonight. Some general comments:
> 1. I think that this patch seems to introduce an awful lot of new
> structures with confusingly similar names and purposes:
> PartitionBoundList, PartitionBoundRange, ListInfo, RangeInfo,
> PartitionRange, PartitionList, ListValue, RangePartition. You've got
> 4 different structures here that all have "Partition" and "Range" in
> the name someplace, including both PartitionRange and RangePartition.
> Maybe there's no way around that kind of duplication; after all there
> are quite a few moving parts here. But it seems like it would be good
> to try to simplify it.

OK, I got rid of most of those structs and now there are only the
following: PartitionListInfo, PartitionRangeInfo, PartitionRangeBound and

> 2. I'm also a bit concerned about the fairly large amount of
> apparently-boilerplate code in partition.c, all having to do with how
> we create all of these data structures and translate between different
> forms of them. I haven't understood that stuff thoroughly enough to
> have a specific idea about how we might be able to get rid of any of
> it, and maybe there's no way. But that too seems like a topic for
> futher investigation. One idea is that maybe some of these things
> should be nodes that piggyback on the existing infrastructure in
> src/backend/nodes instead of inventing a new way to do something
> similar.

I thought about the idea of using the src/backend/nodes infrastructure for
partitioning data structures. However, none of those structures are part
of any existing Node nor do I imagine they will be in future, so making
them a Node seems unnecessary. But then there is a subset of NodeTags
called "TAGS FOR RANDOM OTHER STUFF" which are Node objects that not part
of parse/plan/execute node tree structures such as T_TriggerData,
T_ForeignKeyCacheInfo. I wonder if you meant to make partitioning
structures nodes of that category.

That said, I have tried in the current patch to reduce the amount of
unnecessary boilerplate code.

> 3. There are a lot of places where you have separate code for the
> range and list partitioning cases, and I'm suspicious that there are
> ways that code could be unified. For example, with list partitioning,
> you have a bunch of Datums which represent the specific values that
> can appear in the various partitions, and with range partitioning, you
> have a bunch of Datums that represent the edges of partitions. Well,
> if you used the same array for both purposes, you could use the same
> code to copy it. That would involve flattening away a lot of the
> subsidiary structure under PartitionDescData and pulling up stuff that
> is buried lower down into the main structure, but I think that's
> likely a good idea anyway - see also point #1.

Hmm, I think it might be somewhat clearer to keep them separate in the
form of PartitionListInfo and PartitionRangeInfo structs. In case of
range partitioning, each individual range bound is actually a
PartitionRangeBound instance, not just a Datum. That's because we have to
store for each range bound the following information: an array of Datums
to represent the composite partition key, is-finite flags for each
partitioning column, a flag representing whether it is a inclusive bound,
and a flag representing whether it is lower or upper bound.

> 4. I'm somewhat wondering if we ought to just legislate that the lower
> bound is always inclusive and the upper bound is always exclusive.
> The decision to support both inclusive and exclusive partition bounds
> is responsible for an enormous truckload of complexity in this patch,
> and I have a feeling it is also going to be a not-infrequent cause of
> user error.

I thought we decided at some point to go with range type like notation to
specify range partition bound because of its flexibility. I agree though
that with that flexibility, there will more input combinations that will
cause error. As for the internal complexity, it's not clear to me whether
it will be reduced by always-inclusive lower and always-exclusive upper
bounds. We would still need to store the inclusive flag with individual
PartitionRangeBound and consider it when comparing them with each other
and with partition key of tuples.

> 5. I wonder how well this handles multi-column partition keys. You've
> just got one Datum flag and one is-finite flag per partition, but I
> wonder if you don't need to track is-finite on a per-column basis, so
> that you could partition on (a, b) and have the first partition go up
> to (10, 10), the second to (10, infinity), the third to (20, 10), the
> fourth to (20, infinity), and the last to (infinity, infinity). FWIW,
> Oracle supports this sort of thing, so perhaps we should, too. On a

I have implemented the feature that allows specifying UNBOUNDED per column.

> related note, I'm not sure it's going to work to treat a composite
> partition key as a record type. The user may want to specify a
> separate opfamily and collation for each column, not just inherit
> whatever the record behavior is. I'm not sure if that's what you are
> doing, but the relcache structures don't seem adapted to storing one
> Datum per partitioning column per partition, but rather just one Datum
> per partition, and I'm not sure that's going to work very well.

Actually, there *is* one Datum per partitioning column. That is,
composite partition key is not treated as a record as it may have seemed.

Please find attached the latest version of the patches taking care of the
above review comments and some other issues I found. As Amit Kapila
requested [2], here is a list of restrictions on partitioned tables, of
which some we might be able to overcome in future:

- cannot create indexes
- cannot create row triggers
- cannot specify UNIQUE, PRIMARY KEY, FOREIGN KEY, EXCLUDE constraints
- cannot become inheritance parent or child
- cannot use in COPY TO command




Attachment Content-Type Size
0001-Catalog-and-DDL-for-partitioned-tables-10.patch text/x-diff 113.8 KB
0002-psql-and-pg_dump-support-for-partitioned-tables-10.patch text/x-diff 26.2 KB
0003-Catalog-and-DDL-for-partitions-10.patch text/x-diff 201.0 KB
0004-psql-and-pg_dump-support-for-partitions-10.patch text/x-diff 21.3 KB
0005-Refactor-optimizer-s-inheritance-set-expansion-code-10.patch text/x-diff 16.3 KB
0006-Teach-a-few-places-to-use-partition-check-quals-10.patch text/x-diff 30.2 KB
0007-Introduce-a-PartitionTreeNode-data-structure-10.patch text/x-diff 8.0 KB
0008-Tuple-routing-for-partitioned-tables-10.patch text/x-diff 43.8 KB
0009-Update-DDL-Partitioning-chapter-to-reflect-new-devel-10.patch text/x-diff 24.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Gilles Darold 2016-10-28 08:03:37 Re: Patch to implement pg_current_logfile() function
Previous Message Andres Freund 2016-10-28 07:40:29 Re: Proposal : For Auto-Prewarm.