Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-10-04 08:02:54
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for the comments.

On 2016/09/30 9:10, Robert Haas wrote:
> On Thu, Sep 29, 2016 at 8:09 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Things were that way initially, that is, the parent relations had no
>> relfilenode. I abandoned that project however. The storage-less parent
>> thing seemed pretty daunting to me to handle right away. For example,
>> optimizer and the executor code needed to be taught about the parent rel
>> appendrel member that used to be included as part of the result of
>> scanning the inheritance set but was no longer.
> Even if we leave the empty relfilenode around for now -- in the long
> run I think it should die -- I think we should prohibit the creation
> of subsidiary object on the parent which is only sensible if it has
> rows - e.g. indexes. It makes no sense to disallow non-inheritable
> constraints while allowing indexes, and it could box us into a corner
> later.

I agree. So we must prevent from the get-go the creation of following
objects on parent tables (aka RELKIND_PARTITIONED_TABLE relations):

* Indexes
* Row triggers (?)

In addition to preventing creation of these objects, we must also teach
commands that directly invoke heapam.c to skip such relations.

I have not implemented that in the patch yet though.

>>> + /*
>>> + * Run the expressions through eval_const_expressions. This is
>>> + * not just an optimization, but is necessary, because eventually
>>> + * the planner will be comparing them to similarly-processed qual
>>> + * clauses, and may fail to detect valid matches without this.
>>> + * We don't bother with canonicalize_qual, however.
>>> + */
>>> I'm a bit confused by this, because I would think this processing
>>> ought to have been done before storing anything in the system
>>> catalogs. I don't see why it should be necessary to do it again after
>>> pulling data back out of the system catalogs.
>> The pattern matches what's done for other expressions that optimizer deals
>> with, such as CHECK, index key, and index predicate expressions.
> That's kind of a non-answer answer, but OK. Still, I think you
> shouldn't just copy somebody else's comment blindly into a new place.
> Reference the original comment, or write your own.

Sorry I could have explained a bit more in my previous message. I rewrote
the comment as follows:

* Run the expressions through const-simplification since the planner
* will be comparing them to similarly-processed qual clause operands,
* and may fail to detect valid matches without this step. We don't
* need to bother with canonicalize_qual() though, because partition
* expressions are not full-fledged qualification clauses.

>>> + if (classform->relkind != relkind &&
>>> + (relkind == RELKIND_RELATION &&
>>> + classform->relkind != RELKIND_PARTITIONED_TABLE))
>>> That's broken. Note that all of the conditions are joined using &&,
>>> so if any one of them fails then we won't throw an error. In
>>> particular, it's no longer possible to throw an error when relkind is
>> You are right. I guess it would have to be the following:
>> + if ((classform->relkind != relkind &&
>> + classform->relkind != RELKIND_PARTITIONED_TABLE) ||
>> + (classform->relkind == RELKIND_PARTITIONED_TABLE &&
>> + relkind != RELKIND_RELATION))
>> Such hackishness could not be helped because we have a separate DROP
>> command for every distinct relkind, except we overload DROP TABLE for both
>> regular and partitioned tables.
> Maybe this would be better:
> syntax_relkind = RELKIND_RELATION;
> else
> syntax_relkind = rekind;
> if (classform->relkind != syntax_relkind)
> DropErrorMsgWrongType(rel->relname, classform->relkind, relkind);

In this case, relkind refers to the command viz. DROP <ObjectType/relkind>
<relname>. It can never be RELKIND_PARTITIONED_TABLE, so the above will
be equivalent to leaving things unchanged. Anyway I agree the suggested
style is better, but I assume you meant:

if (classform->relkind == RELKIND_PARTITIONED_TABLE)
expected_relkind = RELKIND_RELATION;
expected_relkind = classform->relkind;

if (relkind != expected_relkind)
DropErrorMsgWrongType(rel->relname, classform->relkind, relkind);

>>> Why isn't this logic being invoked from transformCreateStmt()? Then
>>> we could use the actual parseState for the query instead of a fake
>>> one.
>> Because we need an open relation for it to work, which in this case there
>> won't be until after we have performed heap_create_with_catalog() in
>> DefineRelation(). Mainly because we need to perform transformExpr() on
>> expressions. That's similar to how cookConstraint() on the new CHECK
>> constraints cannot be performed earlier. Am I missing something?
> Hmm, yeah, I guess that's the same thing. I guess I got on this line
> of thinking because the function name started with "transform" which
> is usually something happening during parse analysis only. Maybe add
> a code comment explaining why the work can't be done sooner, just like
> what you've written here.

Added the comment.

>> Hmm, I guess it wouldn't hurt to just leave any COLLATE clauses as it is -
>> just remove the above code. Of course, we must then record the collation
>> in the catalog alongside other user-specified information such as operator
>> class. Currently, if the key is a simple column we use its attcollation
>> and if it's an expression then we use its exprCollation().
>> When I first wrote it, I wasn't sure what the implications of explicit
>> collations would be for partitioning. There are two places where it comes
>> into play: a) when comparing partition key values using the btree compare
>> function b) embedded as varcollid and inputcollid in implicitly generated
>> check constraints for partitions. In case of the latter, any mismatch
>> with query-specified collation causes constraint exclusion proof to be
>> canceled. When it's default collations everywhere, the chances of that
>> sort of thing happening are less. If we support user-specified collations
>> on keys, then things will get a little bit more involved.
> I think it's worth rewinding to general principles here: the only time
> a collation has real meaning is in the context of a comparison
> operation. If we ask whether A < B, the answer may depend on the
> collation that is used to perform the comparison. Any other place
> that mentions a collation is only trying to influence the collation
> that gets used for some comparison that will happen later - so, for
> example, for a table column, attcollation is setting the default
> collation for comparisons involving that column. The column is not
> itself collated; that doesn't really make sense.
> In the case of partitioning, there is really exactly one thing that
> matters: everybody's got to agree on the collation to be used to
> compare actual or hypothetical values of the partitioning columns
> against the partition bounds. If we've got a set of partition bounds
> b_1, b_2, ..., b_n and a set of partitions p_1, p_2, ..., p_{n-1} such
> that a row in p_i must have a key k such that b_i < k and k < b_{i+1},
> and if the meaning of that "<" operator is collation-dependent, then
> we had better agree on which collation is in use every single time we
> do the test.
> Indexes, of course, have this exact same problem, at least if, like
> btree, they rely on ordering. We search the index by applying the "<"
> operator to compare various values taken from the index tuples with
> the value we're trying to find, and it is absolutely vital that the
> collation used for lookups remain constant and that it is the same as
> the collation used for inserts, which had also better remain constant.
> That is why pg_index has an indcollation column of type oidvector: it
> tells us which collation to use for each index column. It pairs with
> indclass, which tells us which operator class to use for each index
> column. I think that partitioning will need exact analogues -
> partclass and partcollation - because it has exactly the same problem.
> Currently, you've got partclass but not partcollation.
> I'd in general recommend that you try to follow the index precedent
> here as closely as practical.

Thanks for the explanation. I added a new field to the catalog called

That means a couple of things - when comparing input key values (consider
a new tuple being routed) to partition bounds using the btree compare
function, we use the corresponding key column's partcollation. The same
is also used as inputcollid of the OpExpr (or ScalarArrayOpExpr) in the
implicitly generated CHECK constraints.

However, varcollid of any Var nodes in the above expressions is still the
corresponding attributes' attcollation.

Needless to say, a query-specified qualification clause must now include
the same collate clause as used for individual partition columns (or
expressions) for the constraint exclusion to work as intended.

>>> + char relkind = ((CreateStmt *)
>>> stmt)->partby != NULL
>>> Let's push this down into DefineRelation(). i.e. if (stmt->partby !=
>>> NULL) { if (relkind != RELKIND_RELATION) ereport(...); relkind =
>> Done. By the way, I made the ereport say the following: "unexpected
>> relkind value passed to DefineRelation", did you intend it to say
>> something else?
> You don't need it to be translatable because it should only happen if
> there's a bug in the code; users shouldn't actually be able to hit it.
> So use elog() rather than ereport(). You don't need to include the
> function name, either; the user can use \errverbose or \set VERBOSITY
> verbose to get the file and line number if they need it. So just
> elog(ERROR, "unexpected relkind") should be fine; or maybe elog(ERROR,
> "unexpected relkind: %d", (int) relkind), following similar precedent
> elsewhere.

Done as: elog(ERROR, "unexpected relkind: %d", (int) relkind)

> Reading through the latest 0001:
> + The catalog <structname>pg_partitioned_table</structname> stores information
> + about the partition key of tables.
> Maybe "stores information about how tables are partitioned".


> + Partitioning strategy (or method); <literal>l</> = list
> partitioned table,
> I don't think we need "(or method)".


> + <entry><structfield>partexprbin</structfield></entry>
> Is there any good reason not to do partexprbin -> partexpr throughout the patch?

Agreed. Though I used partexprs (like indexprs).

> + A partitioned table is divided into sub-tables (called partitions), which
> + in turn, are created using separate <literal>CREATE TABLE</> commands.
> I would delete "in turn" and the subsequent comma, so that it says
> "which are created using separate".


> + The table itself is empty. A data row inserted into the table is mapped
> + to and stored in one of the partitions (if one exists) based on the
> + values of columns or expressions in the partition key and partition
> + bound constraints associated with the individual partitions.
> How about: "A data row inserted into the table is routed to a
> partition based on the value of columns or expressions in the
> partition key. If no existing partition matches the values in the new
> row, an error will occur."


> + Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or FOREIGN
> + KEY constraints; however, you can define these constraints on individual
> + data partitions.
> Delete "data". Add <literal> tags around the keywords.


> + tuple = SearchSysCache1(PARTEDRELID,
> + ObjectIdGetDatum(RelationGetRelid(rel)));
> + /* Cannot already exist */
> + Assert(!HeapTupleIsValid(tuple));
> This seems pointless. It's hard to see how the tuple could already
> exist, but if it does, this will fail a little later on anyway when we
> do simple_heap_insert() and CatalogUpdateIndexes() for the new tuple.
> In general, if you're doing work only to support an Assert(), you
> should put #ifdef USE_ASSERT_CHECKING around it, but in this case I'd
> just rip this out.

OK, removed the Assert.

> + myself.objectId = RelationGetRelid(rel);;
> Extra semicolon.


> + /*
> + * Store dependencies on anything mentioned in the key expressions.
> + * However, ignore the column references which causes self-dependencies
> + * to be created that are undesirable. That is done by asking the
> + * dependency-tracking sub-system to ignore any such dependencies.
> + */
> I think this comment is spending a lot of time explaining the
> mechanism when what it should be doing is explaining why this case
> arises here and not elsewhere.

Rewrote the comment as follows:

* Anything mentioned in the expressions. We must ignore the column
* references which will count as self-dependency items; in this case,
* the depender is the table itself (there is no such thing as partition
* key object).

> + if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE)
> Extra space before &&.


> + if(partattno != 0)
> Missing space.


> + * Identify a btree opclass to use. Currently, we use only btree
> + * operators which seems enough for list and range partitioning.
> Probably best to add a comma before "which".


> break;
> case T_ForeignKeyCacheInfo:
> _outForeignKeyCacheInfo(str, obj);
> + case T_PartitionSpec:
> + _outPartitionSpec(str, obj);
> + break;
> + case T_PartitionElem:
> + _outPartitionElem(str, obj);
> break;
> default:
> Missing break.

Oops, fixed.

> + n->strategy = PARTITION_STRAT_RANGE;
> Let's not abbreviate STRATEGY to STRAT in the names of these constants.


> return "WHEN";
> + return "partition key expression";
> I think you should say "PARTITION BY" here. See the function header
> comment for ParseExprKindName.

but this seems to work better. Also, I renamed EXPR_KIND_PARTITION_KEY to

By the way, a bunch of error messages say "partition key expression". I'm
assuming it's better to leave them alone, that is, not rewrite them as
"PARTITION BY expressions".

> + errmsg("cannot use more than one column in partition key"),
> + errdetail("Only one column allowed with list
> partitioning.")));
> How about combining these two: cannot list partition using more than one column


> + * Note that the partition key data attached to a relcache entry must be
> + * stored CacheMemoryContext to ensure it survives as long as the relcache
> + * entry. But we should be running in a less long-lived working context.
> + * To avoid leaking cache memory if this routine fails partway through,
> + * we build in working memory and then copy the completed structure into
> + * cache memory.
> Again, don't just copy existing comments. Refer to them.

Rewrote the comment to explain two points: why build in the working
context instead of directly in the CacheMemoryContext and why use a
private child context of CacheMemoryContext to store the information.

> + * To retrieve further variable-length attributes, we'd need the catlog's
> Typo.


> + * pg_partitioned_table_fn.h
> + * prototypes for functions in catalog/pg_partitioned_table.c
> Is it really worth having a separate header file for ONE function?

Previously the two functions in this file were defined in catalog/heap.c
and I think they could be moved back there alongside such folks as
AddRelationNewConstraints, StoreAttrDefault, RemoveAttributeById,
RemoveStatistics, etc. So did.

There are no longer pg_partitioned_table.c and pg_partitioned_table_fn.h

> I bet you can avoid making this a keyword. PARTITION BY IDENT in the
> grammar, or something like that.

Managed to do that with one more production as follows:

+part_strategy: IDENT { $$ = $1; }
+ | unreserved_keyword { $$ = pstrdup($1); }
+ ;

And then PARTITION BY part_strategy

That's because "range" is already a keyword.

> +CREATE TABLE pkrel(
> + a int PRIMARY KEY
> +);
> Add a space.


Attached updated patches.


Attachment Content-Type Size
0001-Catalog-and-DDL-for-partitioned-tables-7.patch text/x-diff 112.5 KB
0002-psql-and-pg_dump-support-for-partitioned-tables-7.patch text/x-diff 22.5 KB
0003-Catalog-and-DDL-for-partitions-7.patch text/x-diff 205.0 KB
0004-psql-and-pg_dump-support-for-partitions-7.patch text/x-diff 20.6 KB
0005-Refactor-optimizer-s-inheritance-set-expansion-code-7.patch text/x-diff 16.2 KB
0006-Teach-a-few-places-to-use-partition-check-quals-7.patch text/x-diff 30.2 KB
0007-Introduce-a-PartitionTreeNode-data-structure-7.patch text/x-diff 8.0 KB
0008-Tuple-routing-for-partitioned-tables-7.patch text/x-diff 40.6 KB
0009-Update-DDL-Partitioning-chapter-to-reflect-new-devel-7.patch text/x-diff 24.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2016-10-04 08:15:27 Re: multivariate statistics (v19)
Previous Message Dean Rasheed 2016-10-04 07:49:38 Re: multivariate statistics (v19)