Re: Declarative partitioning - another take

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-09-20 13:41:06
Message-ID: CA+Tgmobs7C+4t5xm9LGopUb=2HF+SBE2EQ2NRH5EMd+drs6OUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 15, 2016 at 4:53 AM, Amit Langote
<Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> [ new patches ]

Re-reviewing 0001.

+ <entry><structfield>partexprs</structfield></entry>
+ <entry><type>pg_node_tree</type></entry>

This documentation doesn't match pg_partition_table.h, which has
partexprsrc and partexprbin. I don't understand why it's a good idea
to have both, and there seem to be no comments or documentation
supporting that choice anywhere.

+ The optional <literal>PARTITION BY</> clause specifies a method of
+ partitioning the table and the corresponding partition key. Table
+ thus created is called <firstterm>partitioned</firstterm> table. Key
+ consists of an ordered list of column names and/or expressions when
+ using the <literal>RANGE</> method, whereas only a single column or
+ expression can be specified when using the <literal>LIST</> method.
+ The type of a key column or an expression must have an associated
+ btree operator class or one must be specified along with the column
+ or the expression.

Both of the sentences in this paragraph that do not begin with "the"
need to begin with "the". (In my experience, it's generally a feature
of English as spoken in India that connecting words like "the" and "a"
are sometimes left out where non-Indian speakers of English would
include them, so it would be good to watch out for this issue in
general.)

Also, I think this should be rephrased a bit to be more clear about
how the partitioning key works, like this: The optional
<literal>PARTITION BY</literal> clause specifies a method of
partitioning the table. The table thus created is called a
<firstterm>partitioned</firstterm> table. The parenthesized list of
expressions forms the <firsttem>partitioning key</firstterm> for the
table. When using range partitioning, the partioning key can include
multiple columns or expressions, but for list partitioning, the
partitioning key must consist of a single column or expression. If no
btree operator class is specified when creating a partitioned table,
the default btree operator class for the datatype will be used. If
there is none, an error will be reported.

+ case RELKIND_PARTITIONED_TABLE:
options = heap_reloptions(classForm->relkind, datum, false);

Why? None of the reloptions that pertain to heap seem relevant to a
relkind without storage.

But, ah, do partitioned tables have storage? I mean, do we end up
with an empty file, or no relfilenode at all? Can I CLUSTER, VACUUM,
etc. a partitioned table? It would seem cleaner for the parent to
have no relfilenode at all, but I'm guessing we might need some more
changes for that to work out.

+ pg_collation.h pg_range.h pg_transform.h pg_partitioned_table.h\

Whitespace. Also, here and elsewhere, how about using alphabetical
order, or anyway preserving it insofar as the existing list is
alphabetized?

+ /* Remove NO INHERIT flag if rel is a partitioned table */
+ if (is_no_inherit &&
+ rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ errmsg("cannot add NO INHERIT constraint to
partitioned table \"%s\"",
+ RelationGetRelationName(rel))));

The code and the comment disagree. I think the code is right and the
comment should be adjusted to say something like /* Partitioned tables
do not have storage, so a NO INHERIT constraint makes no sense. */

+ * IDENTIFICATION
+ * src/backend/utils/misc/partition.c

Wrong.

+} KeyTypeCollInfo;

I don't like this type name very much. Can we get "Part" in there someplace?

It doesn't seem to be very well-designed, either. The number of
entries in each array is determined by the partnatts flag in
PartitionKeyData, which has also got various other arrays whose
lengths are determined by partnatts. Why do we have some arrays in
one structure and some arrays in another structure? Would it hurt
anything to merge everything into one structure? Or could
PartitionKeyData include a field of type KeyTypeCollInfo rather than
KeyTypeCollInfo *, saving one pointer reference every place we access
this data?

+ /* Allocate in the supposedly short-lived working context */

Why supposedly?

+ datum = fastgetattr(tuple, Anum_pg_partitioned_table_partattrs,
+ RelationGetDescr(catalog),
+ &isnull);

Isn't the point of putting the fixed-length fields first that we can
use GETSTRUCT() here? And even for partattrs as the first
variable-length thing?

+ /*
+ * Run the expressions through eval_const_expressions. This is
+ * not just an optimization, but is necessary, because eventually
+ * the planner will be comparing them to similarly-processed qual
+ * clauses, and may fail to detect valid matches without this.
+ * We don't bother with canonicalize_qual, however.
+ */

I'm a bit confused by this, because I would think this processing
ought to have been done before storing anything in the system
catalogs. I don't see why it should be necessary to do it again after
pulling data back out of the system catalogs.

+ Value *str = lfirst(partexprsrc_item);
+ key->partcolnames[i] = pstrdup(str->val.str);

Should have a blank line in between.

+/*
+ * Partition key information inquiry functions
+ */
+int
+get_partition_key_strategy(PartitionKey key)
+{
+ return key->strategy;
+}
+
+int
+get_partition_key_natts(PartitionKey key)
+{
+ return key->partnatts;
+}
+
+List *
+get_partition_key_exprs(PartitionKey key)
+{
+ return key->partexprs;
+}
+
+/*
+ * Partition key information inquiry functions - one column
+ */
+int16
+get_partition_col_attnum(PartitionKey key, int col)
+{
+ return key->partattrs[col];
+}
+
+Oid
+get_partition_col_typid(PartitionKey key, int col)
+{
+ return key->tcinfo->typid[col];
+}
+
+int32
+get_partition_col_typmod(PartitionKey key, int col)
+{
+ return key->tcinfo->typmod[col];
+}

If we're going to add notation for this, I think we should use macros
(or static inline functions defined in the header file). Doing it
this way adds more cycles for no benefit.

+ newkey->partattrs = (AttrNumber *)
+ palloc0(newkey->partnatts * sizeof(AttrNumber));
+ memcpy(newkey->partattrs, fromkey->partattrs,
+ newkey->partnatts * sizeof(AttrNumber));

It's wasteful to use palloc0 if you're immediately going to overwrite
every byte in the array. Use regular palloc instead.

+ * Copy the partition key, opclass info into arrays (should we
+ * make the caller pass them like this to start with?)

Only if it happens to be convenient for the caller, which doesn't seem
to be the case here.

+ /* Only this can ever be NULL */
+ if (!partexprbinDatum)
+ {
+ nulls[Anum_pg_partitioned_table_partexprbin - 1] = true;
+ nulls[Anum_pg_partitioned_table_partexprsrc - 1] = true;
+ }

How can it be valid to have no partitioning expressions?

+ /* Tell world about the key */
+ CacheInvalidateRelcache(rel);

Is this really needed? Isn't the caller going to do something similar
pretty soon?

+ heap_freetuple(tuple);

Probably useless - might as well let the context reset clean it up.

+ simple_heap_delete(rel, &tuple->t_self);
+
+ /* Update the indexes on pg_partitioned_table */
+ CatalogUpdateIndexes(rel, tuple);

You don't need CatalogUpdateIndexes() after a delete, only after an
insert or update.

+ if (classform->relkind != relkind &&
+ (relkind == RELKIND_RELATION &&
+ classform->relkind != RELKIND_PARTITIONED_TABLE))

That's broken. Note that all of the conditions are joined using &&,
so if any one of them fails then we won't throw an error. In
particular, it's no longer possible to throw an error when relkind is
not RELKIND_RELATION.

+/* Checks if a Var node is for a given attnum */
+static bool
+find_attr_reference_walker(Node *node, find_attr_reference_context *context)
+{
+ if (node == NULL)
+ return false;
+
+ if (IsA(node, Var))
+ {
+ Var *var = (Var *) node;
+ char *varattname = get_attname(context->relid, var->varattno);
+
+ if (!strcmp(varattname, context->attname))
+ return true;
+ }
+
+ return expression_tree_walker(node, find_attr_reference_walker, context);
+}

Hrm. The comment says we're matching on attnum, but the code says
we're matching on attname. is_partition_attr() has the same confusion
between comments and code. Maybe instead of this whole approach it
would be better to use pull_varattnos(), then get_attnum() to find the
attribute number for the one you want, then bms_is_member().

+static PartitionBy *
+transformPartitionBy(Relation rel, PartitionBy *partitionby)
+{
+ PartitionBy *partby;
+ ParseState *pstate;
+ RangeTblEntry *rte;
+ ListCell *l;
+
+ partby = (PartitionBy *) makeNode(PartitionBy);
+
+ partby->strategy = partitionby->strategy;
+ partby->location = partitionby->location;
+ partby->partParams = NIL;
+
+ /*
+ * Create a dummy ParseState and insert the target relation as its sole
+ * rangetable entry. We need a ParseState for transformExpr.
+ */
+ pstate = make_parsestate(NULL);

Why isn't this logic being invoked from transformCreateStmt()? Then
we could use the actual parseState for the query instead of a fake
one.

+ if (IsA(expr, CollateExpr))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("cannot use COLLATE in partition key
expression")));

I assume there is a good reason for this seemingly-arbitrary
restriction, but there's no comment saying what it is. One thing
that's odd is that this will only prohibit a CollateExpr at the top
level, not in some more-deeply nested position. That seems
suspicious.

+ /*
+ * User wrote "(column)" or "(column COLLATE something)".
+ * Treat it like simple attribute anyway.
+ */

Evidently, the user did not do that, because you just prohibited the
second one of those.

+ if (IsA(expr, Const))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("cannot use a constant expression
as partition key")));
+
+ if (contain_mutable_functions(expr))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("functions in partition key
expression must be marked IMMUTABLE")));

Do these checks parallel what we do for CHECK constraints? It might
be good to apply about the same level of rigor in both cases.

+ exprsrc = deparse_expression(expr,
+ deparse_context_for(RelationGetRelationName(rel),
+ RelationGetRelid(rel)),
+ false, false);

Why bother? The output of this doesn't seem like a useful thing to
store. The fact that we've done similar things elsewhere doesn't make
it a good idea. I think we did it in other cases because we used to
be dumber than we are now.

+ (errcode(ERRCODE_UNDEFINED_OBJECT),
+ errmsg("data type %s has no default btree
operator class",
+ format_type_be(atttype)),
+ errhint("You must specify an existing btree
operator class or define one for the type.")));

The hint is not really accurate, because the type may well have a
btree operator class. Just not a default one.

+ char relkind = ((CreateStmt *)
stmt)->partby != NULL
+ ? RELKIND_PARTITIONED_TABLE
+ : RELKIND_RELATION;

Let's push this down into DefineRelation(). i.e. if (stmt->partby !=
NULL) { if (relkind != RELKIND_RELATION) ereport(...); relkind =
RELKIND_PARTITION_TABLE; }

+ RelationBuildPartitionKey(relation);

I wonder if RelationBuildPartitionKey should really be in relcache.c.
What do we do in similar cases?

+} PartitionBy;

Maybe PartitionSpec?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stas Kelvich 2016-09-20 14:13:19 Re: Speedup twophase transactions
Previous Message Amit Kapila 2016-09-20 13:24:03 Re: Parallel sec scan in plpgsql