Re: Declarative partitioning - another take

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-09-30 00:10:40
Message-ID: CA+TgmoaPxXJ14eDVia514UiuQAXyZGqfbz8Qg3G4a8Rz2gKF7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 29, 2016 at 8:09 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I removed DEPENDENCY_IGNORE. Does the following look good or am I still
> missing something?

You missed your commit message, but otherwise looks fine.

>> Also, I think this should be rephrased a bit to be more clear about
>> how the partitioning key works, like this: The optional
>> <literal>PARTITION BY</literal> clause specifies a method of
>> partitioning the table. The table thus created is called a
>> <firstterm>partitioned</firstterm> table. The parenthesized list of
>> expressions forms the <firsttem>partitioning key</firstterm> for the
>> table. When using range partitioning, the partioning key can include
>> multiple columns or expressions, but for list partitioning, the
>> partitioning key must consist of a single column or expression. If no
>> btree operator class is specified when creating a partitioned table,
>> the default btree operator class for the datatype will be used. If
>> there is none, an error will be reported.
>
> Revised text along these lines.

It seems you copied my typo: my text has "partioning" for
"partitioning" and your patch now has that, too.

> Things were that way initially, that is, the parent relations had no
> relfilenode. I abandoned that project however. The storage-less parent
> thing seemed pretty daunting to me to handle right away. For example,
> optimizer and the executor code needed to be taught about the parent rel
> appendrel member that used to be included as part of the result of
> scanning the inheritance set but was no longer.

Even if we leave the empty relfilenode around for now -- in the long
run I think it should die -- I think we should prohibit the creation
of subsidiary object on the parent which is only sensible if it has
rows - e.g. indexes. It makes no sense to disallow non-inheritable
constraints while allowing indexes, and it could box us into a corner
later.

>> + /*
>> + * Run the expressions through eval_const_expressions. This is
>> + * not just an optimization, but is necessary, because eventually
>> + * the planner will be comparing them to similarly-processed qual
>> + * clauses, and may fail to detect valid matches without this.
>> + * We don't bother with canonicalize_qual, however.
>> + */
>>
>> I'm a bit confused by this, because I would think this processing
>> ought to have been done before storing anything in the system
>> catalogs. I don't see why it should be necessary to do it again after
>> pulling data back out of the system catalogs.
>
> The pattern matches what's done for other expressions that optimizer deals
> with, such as CHECK, index key, and index predicate expressions.

That's kind of a non-answer answer, but OK. Still, I think you
shouldn't just copy somebody else's comment blindly into a new place.
Reference the original comment, or write your own.

>> How can it be valid to have no partitioning expressions?
>
> Keys that are simply column names are resolved to attnums and stored
> likewise. If some key is an expression, then corresponding attnum is 0
> and the expression itself is added to the list that gets stored into
> partexprbin. It is doing the same thing as index expressions.

Oh, right. Oops.

>> + if (classform->relkind != relkind &&
>> + (relkind == RELKIND_RELATION &&
>> + classform->relkind != RELKIND_PARTITIONED_TABLE))
>>
>> That's broken. Note that all of the conditions are joined using &&,
>> so if any one of them fails then we won't throw an error. In
>> particular, it's no longer possible to throw an error when relkind is
>> not RELKIND_RELATION.
>
> You are right. I guess it would have to be the following:
>
> + if ((classform->relkind != relkind &&
> + classform->relkind != RELKIND_PARTITIONED_TABLE) ||
> + (classform->relkind == RELKIND_PARTITIONED_TABLE &&
> + relkind != RELKIND_RELATION))
>
> Such hackishness could not be helped because we have a separate DROP
> command for every distinct relkind, except we overload DROP TABLE for both
> regular and partitioned tables.

Maybe this would be better:

if (relkind == RELKIND_PARTITIONED_TABLE)
syntax_relkind = RELKIND_RELATION;
else
syntax_relkind = rekind;
if (classform->relkind != syntax_relkind)
DropErrorMsgWrongType(rel->relname, classform->relkind, relkind);

>> Why isn't this logic being invoked from transformCreateStmt()? Then
>> we could use the actual parseState for the query instead of a fake
>> one.
>
> Because we need an open relation for it to work, which in this case there
> won't be until after we have performed heap_create_with_catalog() in
> DefineRelation(). Mainly because we need to perform transformExpr() on
> expressions. That's similar to how cookConstraint() on the new CHECK
> constraints cannot be performed earlier. Am I missing something?

Hmm, yeah, I guess that's the same thing. I guess I got on this line
of thinking because the function name started with "transform" which
is usually something happening during parse analysis only. Maybe add
a code comment explaining why the work can't be done sooner, just like
what you've written here.

> Hmm, I guess it wouldn't hurt to just leave any COLLATE clauses as it is -
> just remove the above code. Of course, we must then record the collation
> in the catalog alongside other user-specified information such as operator
> class. Currently, if the key is a simple column we use its attcollation
> and if it's an expression then we use its exprCollation().
>
> When I first wrote it, I wasn't sure what the implications of explicit
> collations would be for partitioning. There are two places where it comes
> into play: a) when comparing partition key values using the btree compare
> function b) embedded as varcollid and inputcollid in implicitly generated
> check constraints for partitions. In case of the latter, any mismatch
> with query-specified collation causes constraint exclusion proof to be
> canceled. When it's default collations everywhere, the chances of that
> sort of thing happening are less. If we support user-specified collations
> on keys, then things will get a little bit more involved.

I think it's worth rewinding to general principles here: the only time
a collation has real meaning is in the context of a comparison
operation. If we ask whether A < B, the answer may depend on the
collation that is used to perform the comparison. Any other place
that mentions a collation is only trying to influence the collation
that gets used for some comparison that will happen later - so, for
example, for a table column, attcollation is setting the default
collation for comparisons involving that column. The column is not
itself collated; that doesn't really make sense.

In the case of partitioning, there is really exactly one thing that
matters: everybody's got to agree on the collation to be used to
compare actual or hypothetical values of the partitioning columns
against the partition bounds. If we've got a set of partition bounds
b_1, b_2, ..., b_n and a set of partitions p_1, p_2, ..., p_{n-1} such
that a row in p_i must have a key k such that b_i < k and k < b_{i+1},
and if the meaning of that "<" operator is collation-dependent, then
we had better agree on which collation is in use every single time we
do the test.

Indexes, of course, have this exact same problem, at least if, like
btree, they rely on ordering. We search the index by applying the "<"
operator to compare various values taken from the index tuples with
the value we're trying to find, and it is absolutely vital that the
collation used for lookups remain constant and that it is the same as
the collation used for inserts, which had also better remain constant.
That is why pg_index has an indcollation column of type oidvector: it
tells us which collation to use for each index column. It pairs with
indclass, which tells us which operator class to use for each index
column. I think that partitioning will need exact analogues -
partclass and partcollation - because it has exactly the same problem.
Currently, you've got partclass but not partcollation.

I'd in general recommend that you try to follow the index precedent
here as closely as practical.

> However in the present case, this is just one side of a whole partition
> constraint (the other piece being individual partition's bound value), so
> should be treated a bit differently from the CHECK constraints. I modeled
> this on ComputeIndexAttrs() checks viz. the following:
>
> /*
> * An expression using mutable functions is probably wrong,
> * since if you aren't going to get the same result for the
> * same data every time, it's not clear what the index entries
> * mean at all.
> */
> if (CheckMutability((Expr *) expr))
> ereport(ERROR,
> (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> errmsg("functions in index expression must be marked IMMUTABLE")));
>
> Likewise if a partition key expression contained mutable functions, same
> input row could be mapped to different partitions based on the result of
> expression computed using the input values. So, it seems prudent to
> enforce immutability unlike CHECK constraints. Am I missing something?

OK, seems reasonable.

>> + char relkind = ((CreateStmt *)
>> stmt)->partby != NULL
>> + ? RELKIND_PARTITIONED_TABLE
>> + : RELKIND_RELATION;
>>
>> Let's push this down into DefineRelation(). i.e. if (stmt->partby !=
>> NULL) { if (relkind != RELKIND_RELATION) ereport(...); relkind =
>> RELKIND_PARTITION_TABLE; }
>
> Done. By the way, I made the ereport say the following: "unexpected
> relkind value passed to DefineRelation", did you intend it to say
> something else?

You don't need it to be translatable because it should only happen if
there's a bug in the code; users shouldn't actually be able to hit it.
So use elog() rather than ereport(). You don't need to include the
function name, either; the user can use \errverbose or \set VERBOSITY
verbose to get the file and line number if they need it. So just
elog(ERROR, "unexpected relkind") should be fine; or maybe elog(ERROR,
"unexpected relkind: %d", (int) relkind), following similar precedent
elsewhere.

Reading through the latest 0001:

+ The catalog <structname>pg_partitioned_table</structname> stores information
+ about the partition key of tables.

Maybe "stores information about how tables are partitioned".

+ Partitioning strategy (or method); <literal>l</> = list
partitioned table,

I don't think we need "(or method)".

+ <entry><structfield>partexprbin</structfield></entry>

Is there any good reason not to do partexprbin -> partexpr throughout the patch?

+ A partitioned table is divided into sub-tables (called partitions), which
+ in turn, are created using separate <literal>CREATE TABLE</> commands.

I would delete "in turn" and the subsequent comma, so that it says
"which are created using separate".

+ The table itself is empty. A data row inserted into the table is mapped
+ to and stored in one of the partitions (if one exists) based on the
+ values of columns or expressions in the partition key and partition
+ bound constraints associated with the individual partitions.

How about: "A data row inserted into the table is routed to a
partition based on the value of columns or expressions in the
partition key. If no existing partition matches the values in the new
row, an error will occur."

+ Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or FOREIGN
+ KEY constraints; however, you can define these constraints on individual
+ data partitions.

Delete "data". Add <literal> tags around the keywords.

+ tuple = SearchSysCache1(PARTEDRELID,
+ ObjectIdGetDatum(RelationGetRelid(rel)));
+ /* Cannot already exist */
+ Assert(!HeapTupleIsValid(tuple));

This seems pointless. It's hard to see how the tuple could already
exist, but if it does, this will fail a little later on anyway when we
do simple_heap_insert() and CatalogUpdateIndexes() for the new tuple.
In general, if you're doing work only to support an Assert(), you
should put #ifdef USE_ASSERT_CHECKING around it, but in this case I'd
just rip this out.

+ myself.objectId = RelationGetRelid(rel);;

Extra semicolon.

+ /*
+ * Store dependencies on anything mentioned in the key expressions.
+ * However, ignore the column references which causes self-dependencies
+ * to be created that are undesirable. That is done by asking the
+ * dependency-tracking sub-system to ignore any such dependencies.
+ */

I think this comment is spending a lot of time explaining the
mechanism when what it should be doing is explaining why this case
arises here and not elsewhere.

+ if (relkind != RELKIND_RELATION && relkind != RELKIND_PARTITIONED_TABLE)

Extra space before &&.

+ if(partattno != 0)

Missing space.

+ * Identify a btree opclass to use. Currently, we use only btree
+ * operators which seems enough for list and range partitioning.

Probably best to add a comma before "which".

break;
case T_ForeignKeyCacheInfo:
_outForeignKeyCacheInfo(str, obj);
+ case T_PartitionSpec:
+ _outPartitionSpec(str, obj);
+ break;
+ case T_PartitionElem:
+ _outPartitionElem(str, obj);
break;

default:

Missing break.

+ n->strategy = PARTITION_STRAT_RANGE;

Let's not abbreviate STRATEGY to STRAT in the names of these constants.

case EXPR_KIND_TRIGGER_WHEN:
return "WHEN";
+ case EXPR_KIND_PARTITION_KEY:
+ return "partition key expression";

I think you should say "PARTITION BY" here. See the function header
comment for ParseExprKindName.

+ errmsg("cannot use more than one column in partition key"),
+ errdetail("Only one column allowed with list
partitioning.")));

How about combining these two: cannot list partition using more than one column

+ * Note that the partition key data attached to a relcache entry must be
+ * stored CacheMemoryContext to ensure it survives as long as the relcache
+ * entry. But we should be running in a less long-lived working context.
+ * To avoid leaking cache memory if this routine fails partway through,
+ * we build in working memory and then copy the completed structure into
+ * cache memory.

Again, don't just copy existing comments. Refer to them.

+ * To retrieve further variable-length attributes, we'd need the catlog's

Typo.

+ * pg_partitioned_table_fn.h
+ * prototypes for functions in catalog/pg_partitioned_table.c

Is it really worth having a separate header file for ONE function?

+PG_KEYWORD("list", LIST, UNRESERVED_KEYWORD)

I bet you can avoid making this a keyword. PARTITION BY IDENT in the
grammar, or something like that.

+CREATE TABLE pkrel(
+ a int PRIMARY KEY
+);

Add a space.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-09-30 00:14:40 Re: Hash Indexes
Previous Message Peter Geoghegan 2016-09-30 00:07:42 Re: Hash Indexes