Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-09-29 12:09:08
Message-ID: 7aa6a14d-9476-afd7-2c09-6a016eb8c12b@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers


Thanks a lot for the review and sorry about the delay in replying.
Combining responses to two emails.

On 2016/09/20 5:06, Robert Haas wrote:
> On Mon, Aug 15, 2016 at 7:21 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> + if (partexprs)
>>> + recordDependencyOnSingleRelExpr(&myself,
>>> + (Node *) partexprs,
>>> + RelationGetRelid(rel),
>>> + DEPENDENCY_NORMAL,
>>> + DEPENDENCY_IGNORE);
>>>
>>> I don't think introducing a new DEPENDENCY_IGNORE type is a good idea
>>> here. Instead, you could just add an additional Boolean argument to
>>> recordDependencyOnSingleRelExpr. That seems less likely to create
>>> bugs in unrelated portions of the code.
>>
>> I did consider a Boolean argument instead of a new DependencyType value,
>> however it felt a bit strange to pass a valid value for the fifth argument
>> (self_behavior) and then ask using a separate parameter that it (a
>> self-dependency) is to be ignored. By the way, no pg_depend entry is
>> created on such a call, so the effect of the new type's usage seems
>> localized to me. Thoughts?
>
> I think that's not a very plausible argument. If you add a fifth
> argument to that function, then only that function needs to know about
> the possibility of ignoring self-dependencies. If you add a new
> dependency type, then everything that knows about DependencyType needs
> to know about them. That's got to be a much larger surface area for
> bugs. Also, if you look around a bit, I believe you will find other
> examples of cases where one argument is used only for certain values
> of some other argument. That's not a novel design pattern.

I removed DEPENDENCY_IGNORE. Does the following look good or am I still
missing something?

@@ -83,7 +78,6 @@ typedef enum DependencyType
DEPENDENCY_EXTENSION = 'e',
DEPENDENCY_AUTO_EXTENSION = 'x',
DEPENDENCY_PIN = 'p',
- DEPENDENCY_IGNORE = 'g'

@@ -194,7 +188,8 @@ extern void recordDependencyOnExpr(const ObjectAddress
*depender,
extern void recordDependencyOnSingleRelExpr(const ObjectAddress *depender,
Node *expr, Oid relId,
DependencyType behavior,
- DependencyType self_behavior);
+ DependencyType self_behavior,
+ bool ignore_self);

@@ -1450,9 +1449,10 @@ recordDependencyOnSingleRelExpr(const ObjectAddress
*depender,
context.addrs->numrefs = outrefs;

/* Record the self-dependencies */
- recordMultipleDependencies(depender,
- self_addrs->refs, self_addrs->numrefs,
- self_behavior);
+ if (!ignore_self)
+ recordMultipleDependencies(depender,
+ self_addrs->refs, self_addrs->numrefs,
+ self_behavior);

@@ -138,7 +138,7 @@ StorePartitionKey(Relation rel,
(Node *) partexprbin,
RelationGetRelid(rel),
DEPENDENCY_NORMAL,
- DEPENDENCY_IGNORE);
+ DEPENDENCY_NORMAL, true);

> Re-reviewing 0001.
>
> + <entry><structfield>partexprs</structfield></entry>
> + <entry><type>pg_node_tree</type></entry>
>
> This documentation doesn't match pg_partition_table.h, which has
> partexprsrc and partexprbin. I don't understand why it's a good idea
> to have both, and there seem to be no comments or documentation
> supporting that choice anywhere.

Removed partexprsrc, I realized the design whereby it was required to be
stored in the catalog was a dubious one after all.

> + The optional <literal>PARTITION BY</> clause specifies a method of
> + partitioning the table and the corresponding partition key. Table
> + thus created is called <firstterm>partitioned</firstterm> table. Key
> + consists of an ordered list of column names and/or expressions when
> + using the <literal>RANGE</> method, whereas only a single column or
> + expression can be specified when using the <literal>LIST</> method.
> + The type of a key column or an expression must have an associated
> + btree operator class or one must be specified along with the column
> + or the expression.
>
> Both of the sentences in this paragraph that do not begin with "the"
> need to begin with "the". (In my experience, it's generally a feature
> of English as spoken in India that connecting words like "the" and "a"
> are sometimes left out where non-Indian speakers of English would
> include them, so it would be good to watch out for this issue in
> general.)

Thanks for the tip, will try to be careful (rules about a/an/the's can be
too subtle for me sometimes, so any help is much appreciated).

> Also, I think this should be rephrased a bit to be more clear about
> how the partitioning key works, like this: The optional
> <literal>PARTITION BY</literal> clause specifies a method of
> partitioning the table. The table thus created is called a
> <firstterm>partitioned</firstterm> table. The parenthesized list of
> expressions forms the <firsttem>partitioning key</firstterm> for the
> table. When using range partitioning, the partioning key can include
> multiple columns or expressions, but for list partitioning, the
> partitioning key must consist of a single column or expression. If no
> btree operator class is specified when creating a partitioned table,
> the default btree operator class for the datatype will be used. If
> there is none, an error will be reported.

Revised text along these lines.

> + case RELKIND_PARTITIONED_TABLE:
> options = heap_reloptions(classForm->relkind, datum, false);
>
> Why? None of the reloptions that pertain to heap seem relevant to a
> relkind without storage.
>
> But, ah, do partitioned tables have storage? I mean, do we end up
> with an empty file, or no relfilenode at all? Can I CLUSTER, VACUUM,
> etc. a partitioned table? It would seem cleaner for the parent to
> have no relfilenode at all, but I'm guessing we might need some more
> changes for that to work out.

Things were that way initially, that is, the parent relations had no
relfilenode. I abandoned that project however. The storage-less parent
thing seemed pretty daunting to me to handle right away. For example,
optimizer and the executor code needed to be taught about the parent rel
appendrel member that used to be included as part of the result of
scanning the inheritance set but was no longer.

That said, there are still diffs in the patch resulting from simply
informing various sites in the code that there is a new relkind
RELKIND_PARTITIONED_TABLE, most of the sites giving it the same treatment
as RELKIND_RELATION. Except maybe that the new relkind serves as a
convenient shorthand for specifying to various modules that the relation
is a partitioned table that is in some regards to be treated differently
from the ordinary tables.

> + pg_collation.h pg_range.h pg_transform.h pg_partitioned_table.h\
>
> Whitespace. Also, here and elsewhere, how about using alphabetical
> order, or anyway preserving it insofar as the existing list is
> alphabetized?

Fixed.

> + /* Remove NO INHERIT flag if rel is a partitioned table */
> + if (is_no_inherit &&
> + rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> + errmsg("cannot add NO INHERIT constraint to
> partitioned table \"%s\"",
> + RelationGetRelationName(rel))));
>
> The code and the comment disagree. I think the code is right and the
> comment should be adjusted to say something like /* Partitioned tables
> do not have storage, so a NO INHERIT constraint makes no sense. */

Rewrote the comment.

>
> + * IDENTIFICATION
> + * src/backend/utils/misc/partition.c
>
> Wrong.

This file is no longer part of the 0001 patch since I moved the file's
only function to relcache.c as mentioned below. Fixed in the later patch
nonetheless.

>
> +} KeyTypeCollInfo;
>
> I don't like this type name very much. Can we get "Part" in there someplace?
>
> It doesn't seem to be very well-designed, either. The number of
> entries in each array is determined by the partnatts flag in
> PartitionKeyData, which has also got various other arrays whose
> lengths are determined by partnatts. Why do we have some arrays in
> one structure and some arrays in another structure? Would it hurt
> anything to merge everything into one structure? Or could
> PartitionKeyData include a field of type KeyTypeCollInfo rather than
> KeyTypeCollInfo *, saving one pointer reference every place we access
> this data?

I followed your advice to just move the typ* arrays into the
PartitionKeyData struct.

> + /* Allocate in the supposedly short-lived working context */
>
> Why supposedly?

When writing the code, I was thinking of the following header comment that
I copied from somewhere:

* Note that the partition key data attached to a relcache entry must be
* stored CacheMemoryContext to ensure it survives as long as the relcache
* entry. But we should be running in a less long-lived working context.
* To avoid leaking cache memory if this routine fails partway through,
* we build in working memory and then copy the completed structure into
* cache memory.

Removed the comment as it is not very illuminating anymore.

> + datum = fastgetattr(tuple, Anum_pg_partitioned_table_partattrs,
> + RelationGetDescr(catalog),
> + &isnull);
>
> Isn't the point of putting the fixed-length fields first that we can
> use GETSTRUCT() here? And even for partattrs as the first
> variable-length thing?

Right, fixed.

>
> + /*
> + * Run the expressions through eval_const_expressions. This is
> + * not just an optimization, but is necessary, because eventually
> + * the planner will be comparing them to similarly-processed qual
> + * clauses, and may fail to detect valid matches without this.
> + * We don't bother with canonicalize_qual, however.
> + */
>
> I'm a bit confused by this, because I would think this processing
> ought to have been done before storing anything in the system
> catalogs. I don't see why it should be necessary to do it again after
> pulling data back out of the system catalogs.

The pattern matches what's done for other expressions that optimizer deals
with, such as CHECK, index key, and index predicate expressions.

> + Value *str = lfirst(partexprsrc_item);
> + key->partcolnames[i] = pstrdup(str->val.str);
>
> Should have a blank line in between.

Fixed.

> +/*
> + * Partition key information inquiry functions
> + */
> +int
> +get_partition_key_strategy(PartitionKey key)
> +{
> + return key->strategy;
> +}
> +
> [ ... ]
> +
> +int32
> +get_partition_col_typmod(PartitionKey key, int col)
> +{
> + return key->tcinfo->typmod[col];
> +}
>
> If we're going to add notation for this, I think we should use macros
> (or static inline functions defined in the header file). Doing it
> this way adds more cycles for no benefit.

OK, done using the static inline functions defined in the header file way.

> + newkey->partattrs = (AttrNumber *)
> + palloc0(newkey->partnatts * sizeof(AttrNumber));
> + memcpy(newkey->partattrs, fromkey->partattrs,
> + newkey->partnatts * sizeof(AttrNumber));
>
> It's wasteful to use palloc0 if you're immediately going to overwrite
> every byte in the array. Use regular palloc instead.

Done.

> + /* Only this can ever be NULL */
> + if (!partexprbinDatum)
> + {
> + nulls[Anum_pg_partitioned_table_partexprbin - 1] = true;
> + nulls[Anum_pg_partitioned_table_partexprsrc - 1] = true;
> + }
>
> How can it be valid to have no partitioning expressions?

Keys that are simply column names are resolved to attnums and stored
likewise. If some key is an expression, then corresponding attnum is 0
and the expression itself is added to the list that gets stored into
partexprbin. It is doing the same thing as index expressions.

> + /* Tell world about the key */
> + CacheInvalidateRelcache(rel);
>
> Is this really needed? Isn't the caller going to do something similar
> pretty soon?

Oops, I removed that in one of the later patches but shouldn't be there in
the first place. Fixed.

> + heap_freetuple(tuple);
>
> Probably useless - might as well let the context reset clean it up.

Removed.

> + simple_heap_delete(rel, &tuple->t_self);
> +
> + /* Update the indexes on pg_partitioned_table */
> + CatalogUpdateIndexes(rel, tuple);
>
> You don't need CatalogUpdateIndexes() after a delete, only after an
> insert or update.

Ah right, removed.

>
> + if (classform->relkind != relkind &&
> + (relkind == RELKIND_RELATION &&
> + classform->relkind != RELKIND_PARTITIONED_TABLE))
>
> That's broken. Note that all of the conditions are joined using &&,
> so if any one of them fails then we won't throw an error. In
> particular, it's no longer possible to throw an error when relkind is
> not RELKIND_RELATION.

You are right. I guess it would have to be the following:

+ if ((classform->relkind != relkind &&
+ classform->relkind != RELKIND_PARTITIONED_TABLE) ||
+ (classform->relkind == RELKIND_PARTITIONED_TABLE &&
+ relkind != RELKIND_RELATION))

Such hackishness could not be helped because we have a separate DROP
command for every distinct relkind, except we overload DROP TABLE for both
regular and partitioned tables.

> +/* Checks if a Var node is for a given attnum */
> +static bool
> +find_attr_reference_walker(Node *node, find_attr_reference_context *context)
> +{
> + if (node == NULL)
> + return false;
> +
> + if (IsA(node, Var))
> + {
> + Var *var = (Var *) node;
> + char *varattname = get_attname(context->relid, var->varattno);
> +
> + if (!strcmp(varattname, context->attname))
> + return true;
> + }
> +
> + return expression_tree_walker(node, find_attr_reference_walker, context);
> +}
>
> Hrm. The comment says we're matching on attnum, but the code says
> we're matching on attname. is_partition_attr() has the same confusion
> between comments and code. Maybe instead of this whole approach it
> would be better to use pull_varattnos(), then get_attnum() to find the
> attribute number for the one you want, then bms_is_member().

I like the idea to use pull_varattnos(), so done that way.

> +static PartitionBy *
> +transformPartitionBy(Relation rel, PartitionBy *partitionby)
> +{
> + PartitionBy *partby;
> + ParseState *pstate;
> + RangeTblEntry *rte;
> + ListCell *l;
> +
> + partby = (PartitionBy *) makeNode(PartitionBy);
> +
> + partby->strategy = partitionby->strategy;
> + partby->location = partitionby->location;
> + partby->partParams = NIL;
> +
> + /*
> + * Create a dummy ParseState and insert the target relation as its sole
> + * rangetable entry. We need a ParseState for transformExpr.
> + */
> + pstate = make_parsestate(NULL);
>
> Why isn't this logic being invoked from transformCreateStmt()? Then
> we could use the actual parseState for the query instead of a fake
> one.

Because we need an open relation for it to work, which in this case there
won't be until after we have performed heap_create_with_catalog() in
DefineRelation(). Mainly because we need to perform transformExpr() on
expressions. That's similar to how cookConstraint() on the new CHECK
constraints cannot be performed earlier. Am I missing something?

> + if (IsA(expr, CollateExpr))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> + errmsg("cannot use COLLATE in partition key
> expression")));
>
> I assume there is a good reason for this seemingly-arbitrary
> restriction, but there's no comment saying what it is. One thing
> that's odd is that this will only prohibit a CollateExpr at the top
> level, not in some more-deeply nested position. That seems
> suspicious.

Hmm, I guess it wouldn't hurt to just leave any COLLATE clauses as it is -
just remove the above code. Of course, we must then record the collation
in the catalog alongside other user-specified information such as operator
class. Currently, if the key is a simple column we use its attcollation
and if it's an expression then we use its exprCollation().

When I first wrote it, I wasn't sure what the implications of explicit
collations would be for partitioning. There are two places where it comes
into play: a) when comparing partition key values using the btree compare
function b) embedded as varcollid and inputcollid in implicitly generated
check constraints for partitions. In case of the latter, any mismatch
with query-specified collation causes constraint exclusion proof to be
canceled. When it's default collations everywhere, the chances of that
sort of thing happening are less. If we support user-specified collations
on keys, then things will get a little bit more involved.

Thoughts?

> + /*
> + * User wrote "(column)" or "(column COLLATE something)".
> + * Treat it like simple attribute anyway.
> + */
>
> Evidently, the user did not do that, because you just prohibited the
> second one of those.

Holds true now that I have removed the prohibition.

> + if (IsA(expr, Const))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> + errmsg("cannot use a constant expression
> as partition key")));
> +
> + if (contain_mutable_functions(expr))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> + errmsg("functions in partition key
> expression must be marked IMMUTABLE")));
>
> Do these checks parallel what we do for CHECK constraints? It might
> be good to apply about the same level of rigor in both cases.

Both of these checks are not done for CHECK constraints. The only check
performed on them in cookConstraint() checks whether the expression is of
boolean type.

However in the present case, this is just one side of a whole partition
constraint (the other piece being individual partition's bound value), so
should be treated a bit differently from the CHECK constraints. I modeled
this on ComputeIndexAttrs() checks viz. the following:

/*
* An expression using mutable functions is probably wrong,
* since if you aren't going to get the same result for the
* same data every time, it's not clear what the index entries
* mean at all.
*/
if (CheckMutability((Expr *) expr))
ereport(ERROR,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("functions in index expression must be marked IMMUTABLE")));

Likewise if a partition key expression contained mutable functions, same
input row could be mapped to different partitions based on the result of
expression computed using the input values. So, it seems prudent to
enforce immutability unlike CHECK constraints. Am I missing something?

>
> + exprsrc = deparse_expression(expr,
> + deparse_context_for(RelationGetRelationName(rel),
> + RelationGetRelid(rel)),
> + false, false);
>
> Why bother? The output of this doesn't seem like a useful thing to
> store. The fact that we've done similar things elsewhere doesn't make
> it a good idea. I think we did it in other cases because we used to
> be dumber than we are now.

Deparsed expressions are no longer stored in the catalog.

> + (errcode(ERRCODE_UNDEFINED_OBJECT),
> + errmsg("data type %s has no default btree
> operator class",
> + format_type_be(atttype)),
> + errhint("You must specify an existing btree
> operator class or define one for the type.")));
>
> The hint is not really accurate, because the type may well have a
> btree operator class. Just not a default one.

Changed to: "You must specify a btree operator class or define a default
btree operator class for the data type."

> + char relkind = ((CreateStmt *)
> stmt)->partby != NULL
> + ? RELKIND_PARTITIONED_TABLE
> + : RELKIND_RELATION;
>
> Let's push this down into DefineRelation(). i.e. if (stmt->partby !=
> NULL) { if (relkind != RELKIND_RELATION) ereport(...); relkind =
> RELKIND_PARTITION_TABLE; }

Done. By the way, I made the ereport say the following: "unexpected
relkind value passed to DefineRelation", did you intend it to say
something else?

>
> + RelationBuildPartitionKey(relation);
>
> I wonder if RelationBuildPartitionKey should really be in relcache.c.
> What do we do in similar cases?

There a number of RelationBuild* functions that RelationBuildDesc calls
that are reside outside relcache.c in respective modules - trigger.c
(BuildTrigger), policy.c (BuildRowSecurity), whereas some others such as
RelationInitIndexAccessInfo() are housed in relcache.c.

BuildPartitionKey used to be in relcache.c and can be moved back there. So
did.

>
> +} PartitionBy;
>
> Maybe PartitionSpec?

A later patch uses PartitionListSpec and PartitionRangeSpec as parse nodes
for partition bounds. How about call PartitionBy PartitionKeySpec and the
bound nodes just mentioned PartitionBoundList and PartitionBoundRange,
respectively?

Attached revised patches. In addition to addressing comments in this
email (which addressed only the patch 0001), number of other fixes [1] and
other miscellaneous improvements [2] have been included.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/2896956a-8af0-3f91-a3bc-1e5225ae7ed2@lab.ntt.co.jp

[2]
https://www.postgresql.org/message-id/98d6c7e7-4919-4970-a158-49f740101812%40lab.ntt.co.jp

Attachment Content-Type Size
0001-Catalog-and-DDL-for-partitioned-tables-6.patch text/x-diff 111.9 KB
0002-psql-and-pg_dump-support-for-partitioned-tables-6.patch text/x-diff 22.5 KB
0003-Catalog-and-DDL-for-partitions-6.patch text/x-diff 205.8 KB
0004-psql-and-pg_dump-support-for-partitions-6.patch text/x-diff 20.6 KB
0005-Refactor-optimizer-s-inheritance-set-expansion-code-6.patch text/x-diff 16.2 KB
0006-Teach-a-few-places-to-use-partition-check-quals-6.patch text/x-diff 30.2 KB
0007-Introduce-a-PartitionTreeNode-data-structure-6.patch text/x-diff 8.0 KB
0008-Tuple-routing-for-partitioned-tables-6.patch text/x-diff 41.0 KB
0009-Update-DDL-Partitioning-chapter-to-reflect-new-devel-6.patch text/x-diff 24.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2016-09-29 12:12:30 Re: Push down more full joins in postgres_fdw
Previous Message Ashutosh Bapat 2016-09-29 11:06:39 Re: postgres_fdw : altering foreign table not invalidating prepare statement execution plan.