Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-08-26 08:15:45
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry it took me a while to reply. Attached updated patches including the
review comments on 0001 at [1].

On 2016/08/17 3:54, Robert Haas wrote:
> On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> 0002-psql-and-pg_dump-support-for-partitioned-tables.patch
> + if (pset.sversion >= 90600 && tableinfo.relkind == 'P')
> Version check is redundant, right?

Yep, fixed.

> +) PARTITION BY RANGE ((a+b));
> +\d describe_range_key
> +Partitioned table "public.describe_range_key"
> + Column | Type | Modifiers
> +--------+---------+-----------
> + a | integer |
> + b | integer |
> +Partition Key: PARTITION BY RANGE (((a + b)))
> I understand that it's probably difficult not to end up with two sets
> of parentheses here, but can we avoid ending up with three sets?

Fixed. This code was copy-pasted from the index code which has other
considerations for adding surrounding parentheses which don't apply to the
partition key code.

> Also, I wonder if pg_get_partkeydef() should omit "PARTITION BY" and
> pg_dump can add that part back. Then this could say:
> Partition Key: RANGE ((a + b))
> ...which seems a good deal more natural than what you have now.

Agreed, so done that way.

>> 0003-Catalog-and-DDL-for-partition-bounds.patch
>> Partition DDL includes both a way to create new partition and "attach" an
>> existing table as a partition of parent partitioned table. Attempt to
>> drop a partition using DROP TABLE causes an error. Instead a partition
>> needs first to be "detached" from parent partitioned table. On the other
>> hand, dropping the parent drops all the partitions if CASCADE is specified.
> Hmm, I don't think I like this. Why should it be necessary to detach
> a partition before dropping it? That seems like an unnecessary step.

I thought we had better lock the parent table when removing one of its
partitions and it seemed a bit odd to lock the parent table when dropping
a partition using DROP TABLE? OTOH, with ALTER TABLE parent DETACH
PARTITION, the parent table is locked anyway.

> +
> </synopsis>
> Unnecessary hunk.
> + <para>
> + If this table is a partition, one cannot perform <literal>DROP
> + on a column if it is marked not null in the parent table.
> + not null.
> + </para>
> Sentence fragment.


> + <para>
> + Note that unlike the <literal>ATTACH PARTITION</> command, a partition
> + being detached can be itself partitioned. In that case, it continues
> + to exist as such.
> + </para>
> This is another restriction I don't understand. Why can't I attach a
> partitioned table?

I removed this restriction.

ATExecAttachPartition() adds a AT work queue entry for the table being
attached to perform a heap scan in the rewrite phase. The scan is done
for checking that no row violates the partition boundary condition. When
attaching a partitioned table as partition, multiple AT work queue entries
are now added - one for each leaf partition of the table being attached.

> + indicate that descendant tables are included. Note that whether
> + <literal>ONLY</> or <literal>*</> is specified has no effect in case
> + of a partitioned table; descendant tables (in this case, partitions)
> + are always included.
> Ugh, why? I think this should work exactly the same way for
> partitioned tables that it does for any other inheritance hierarchy.
> Sure, you'll get no rows, but who cares?

Agreed, done that way.

> +CREATE FOREIGN TABLE measurement_y2016m07
> + PARTITION OF measurement FOR VALUES START ('2016-07-01') END
> ('2016-08-01');
> + SERVER server_07;
> Extra semicolon?


> + A partition cannot have columns other than those inherited from the
> + parent. That includes the <structfield>oid</> column, which can be
> I think experience suggests that this is a good restriction, but then
> why does the syntax synopsis indicate that PARTITION BY can be
> specified along with column definitions? Similarly for CREATE FOREIGN

The syntax synopsis of CREATE TABLE ... PARTITION OF indicates that a list
of column WITH OPTION and/or table_constraint can be specified. It does
not allow column definitions.

In this case, inherited columns will be listed in the PARTITION BY clause.

Do you mean that the CREATE TABLE ... PARTITION OF syntax should allow
column definitions just like INHERITS does and unlike regular inheritance,
throw error if columns other than those to be merged are found?

> + When specifying for a table being created as partition, one needs to
> + use column names from the parent table as part of the key.
> This is not very clear.

Sentence removed because it may be clear from the context that inherited
columns are to be used for the partition key.

> - /* Remove NO INHERIT flag if rel is a partitioned table */
> - if (relid_is_partitioned(relid))
> + /* Discard NO INHERIT, if relation is a partitioned table or a
> partition */
> + if (relid_is_partitioned(relid) || relid_is_partition(relid))
> is_no_inherit = false;
> It might be right to disallow NO INHERIT in this case, but I don't
> think it can be right to just silently ignore it.

OK, I changed this to instead throw an error if a NO INHERIT check
constraint is added to a partitioned table or a partition.

> + * Not flushed from the cache by RelationClearRelation() unless changed because
> + * of addition or removal of partitions.
> This seems unlikely to be safe, unless I'm missing something.

Like TupleDesc, a table's PartitionDesc is preserved across relcache
rebuilds. PartitionDesc consists of arrays of OIDs and partition bounds
for a table's immediate partitions. It can only change by adding/removing
partitions to/from the table which requires an exclusive lock on it.
Since this data can grow arbitrarily big it seemed better to not have to
copy it around, so a direct pointer to the relcache field (rd_partdesc) is
given to callers. relcache rebuilds that do not logically change
PartitionDesc leave it intact so that some user of it is not left with a
dangling pointer. Am I missing something?

> + form = (Form_pg_inherits) GETSTRUCT(tuple);
> +
> + systable_endscan(scan);
> + heap_close(catalogRelation, AccessShareLock);
> +
> + return form->inhparent;
> This is unsafe. After systable_endscan, it is no longer OK to access
> form->inhparent.
> Try building with CLOBBER_CACHE_ALWAYS to find other cache flush hazards.
> There should probably be a note in the function header comment that it
> is unsafe to use this for an inheritance child that is not a
> partition, because there could be more than one parent in that case.
> Or maybe the whole idea of this function just isn't very sound...

Fixed unsafe coding and added a comment to the function saying it should
be called on tables known to be partitions.

> +static List *
> +get_partitions(Oid relid, int lockmode)
> +{
> + return find_inheritance_children(relid, lockmode);
> +}
> What's the point? If we're going to have a wrapper here at all, then
> shouldn't it have a name that matches the existing convention - e.g.
> find_partitions() or find_child_partitions()? But I think you might
> as well just use find_inheritance_children() directly.

OK, changed to just use find_inheritance_children() directly.

> + * Happens when we have created the pg_inherits entry
> but not the
> + * pg_partition entry yet.
> Why do we ever allow the flow of control to reach this point while we
> are in such an intermediate state?

Fixed to prevent this from happening.

> +free_partition_info(PartitionInfoData **p, int num)
> Seems very error-prone. Isn't this why MemoryContextReset was invented?

Got rid of this function.

> +relid_is_partition(Oid relid)
> +{
> + return SearchSysCacheExists1(PARTRELID, ObjectIdGetDatum(relid));
> +}
> This is used in a lot of places, and the overhead of checking it in
> all of those places is not necessarily nil. Syscache lookups aren't
> free. What if we didn't create a new catalog for this and instead
> just added a relpartitionbound attribute to pg_class? It seems a bit
> silly to have a whole extra catalog to store one extra column...

OK, I got rid of the pg_partition catalog and added a pg_class attribute
for storing partition bound. Because the newly added attribute is a
pg_node_tree and hence not readily accessible without a heap_getattr()
call, I added a boolean relispartition as well. Now all the
relid_is_partition() calls have been replaced by checks using

> /*
> + * If this foreign table is a partition, check that the FDW supports
> + * insert.
> + */
> + if (stmt->base.partbound != NULL)
> + {
> + FdwRoutine *fdw_routine;
> +
> + fdw_routine = GetFdwRoutine(fdw->fdwhandler);
> + if (fdw_routine->ExecForeignInsert == NULL)
> + ereport(ERROR,
> + errmsg("cannot create foreign
> table as partition"),
> + errdetail("foreign-data
> wrapper \"%s\" does not support insert",
> + fdw->fdwname)));
> + }
> Why? This seems like an entirely arbitrary prohibition. If inserts
> aren't supported, then they'll fail at runtime. Same for updates or
> deletes, for which you have not added checks. I think you should just
> remove this.

OK, removed this check.

Instead, ExecInitModifyTable()/BeginCopy() call CheckValidResultRel() for
every leaf partition to check that they are ready for CMD_INSERT.

> + /* Force inheritance recursion, if partitioned table. */
> + if (recurse || relid_is_partitioned(myrelid))
> Disagree with this, too. There's no reason for partitioned tables to
> be special in this way. Similarly, disagree with all of the places
> that do something similar.

Removed the forced recursion bit here and a few other places.

> - errmsg("column \"%s\" in child table
> must be marked NOT NULL",
> - attributeName)));
> + errmsg("column \"%s\" in %s table must
> be marked NOT NULL",
> + attributeName,
> + is_attach_partition ?
> "source" : "child")));
> You have a few of these; they cause problems for translators, because
> different languages have different word ordering. Repeat the entire
> message instead: is_attach_partition ? "column \"%s\" in source table
> must be marked NOT NULL" : "column \"%s\" in child table must be
> marked NOT NULL".

Changed so that the entire message is repeated.

> +-- XXX add ownership tests
> So do that. :-)

Done, sorry about that.

> +ERROR: column "b" is not null in parent
> +HINT: Please drop not null in the parent instead
> Hmm. That hint doesn't seem like project style, and I'm not sure
> that it really makes sense to issue such a hint anyway. Who knows
> whether that is the right thing to do? I think you should somehow be
> complaining about the fact that this is a partition, rather than
> complaining about the fact that the column is NOT NULL in the parent.
> Are we insisting that the flags match exactly, or only that the child
> may not allow nulls unless the parent does?

I think the latter. It is assumed that all the parent's constraints are
present in a child table, because in ExecInsert()/CopyFrom() we perform
ExecConstraints() using the child relation even if the actual insert was
on the parent. Also, if an inherited NOT NULL constraint on a child's
column is dropped irrespective of parent's, selecting a parent's NOT NULL
column might return nulls from the child table that no longer has the

I recently came across a related proposal whereby dropping *inherited* NOT
NULL from child tables will be prevented. Problems in letting it be be
dropped are mentioned here:

That proposal is probably being reworked such that NOT NULL constraints
get a pg_constraint entry with proper accounting of inheritance count.

> +ERROR: new partition's list of values overlaps with partition
> "lpart1" of "list_parted"
> Maybe just:
> ERROR: partitions must not overlap
> -or-
> ERROR: partition "%s" would overlap partition "%s"

OK, I changed to the second message.

> As before, this is just an initial read-through, so apologies for
> whatever I may have missed.

Thanks a lot for the review.

By the way, I am still working on the following items and will be included
in the next version of the patch.

* Fix internal representation of list partition bounds to be more efficient
* Add PartitionOptInfo

As mentioned in [2], I have refactored the inheritance expansion code
within optimizer so that a partitioned table's inheritance hierarchy is
preserved in resulting AppendRelInfos (patch 0005). One immediate benefit
of that is that if constraint exclusion determines that an intermediate
partition is to be excluded then all partitions underneath it are excluded
automatically. With the current flattened model, all partitions in that
subtree would have been processed and excluded one-by-one.



Attachment Content-Type Size
0001-Catalog-and-DDL-for-partitioned-tables-2.patch text/x-diff 115.9 KB
0002-psql-and-pg_dump-support-for-partitioned-tables-2.patch text/x-diff 22.5 KB
0003-Catalog-and-DDL-for-partitions-2.patch text/x-diff 196.6 KB
0004-psql-and-pg_dump-support-for-partitions-2.patch text/x-diff 17.8 KB
0005-Refactor-optimizer-s-inheritance-set-expansion-code-2.patch text/x-diff 14.5 KB
0006-Teach-a-few-places-to-use-partition-check-quals-2.patch text/x-diff 17.2 KB
0007-Introduce-a-PartitionTreeNode-data-structure-2.patch text/x-diff 8.0 KB
0008-Tuple-routing-for-partitioned-tables-2.patch text/x-diff 40.3 KB
0009-Update-DDL-Partitioning-chapter-to-reflect-new-devel-2.patch text/x-diff 24.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Eduardo Morras 2016-08-26 08:45:06 Re: increasing the default WAL segment size
Previous Message Amit Langote 2016-08-26 08:03:34 Re: Declarative partitioning - another take