Re: Declarative partitioning - another take

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-08-15 11:21:49
Message-ID: 7f2b42ec-ce8b-7171-f17f-4e2fca208bac@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Thanks a lot for taking a look at this.

On 2016/08/11 3:22, Robert Haas wrote:
> On Wed, Aug 10, 2016 at 7:09 AM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Attached is the latest set of patches to implement declarative
>> partitioning.
>
> Cool. I would encourage you to give some thought to what is the least
> committable subset of these patches, and think about whether it can be
> reorganized to make that smaller. Based on the descriptions, it
> sounds to me like the least committable subset is probably all of 0001
> - 0005 as one giant commit, and that's a lot of code. Maybe there's
> no real way to come up with something more compact than that, but it's
> worth some thought.

I will consider this and try to come up with a minimal patch set covering
what is now 0001 - 0005.

>> 0001-Catalog-and-DDL-for-partition-key.patch
>
> + <entry><link
> linkend="catalog-pg-partitioned"><structname>pg_partitioned</structname></link></entry>
>
> pg_partitioned seems like a slightly strange choice of name. We have
> no other catalog tables whose names end in "ed", so the use of a past
> participle here is novel. More generally, this patch seems like it's
> suffering a bit of terminological confusion: while the catalog table
> is called pg_partitioned, the relkind is RELKIND_PARTITION_REL. And
> get_relation_by_qualified_name() thinks that RELKIND_PARTITION_REL is
> a "table", but getRelationDescription thinks it's a "partitioned
> table". I think we need to get all these bits and pieces on the same
> page, or have some kind of consistent way of deciding what to do in
> each case. Maybe pg_partitioned_table, RELKIND_PARTITIONED_TABLE,
> etc. for the internal stuff, but just "table" in user-facing messages.

Name pg_partitioned does sound a little strange now that you mention it.

I agree to make the terminology more consistent and to that end, your
suggestion to keep the user-facing term "table" and using
pg_partitioned_table and RELKIND_PARTITIONED_TABLE for internals sounds good.

> Alternatively, I wonder if we should switch to calling this a
> "partition root" rather than a "partitioned table". It's not the
> whole table, just the root. And doesn't contain data - in fact it
> probably shouldn't even have storage - so calling it a "table" might
> be confusing. But if we're using CREATE TABLE to create it in the
> ifrst place, then calling it something other than a table later on is
> also confusing. Hmm.

I think it makes sense to keep calling it a table because it has all the
logical properties of a table even though it will differ from a regular
table on the basis of physical implementation details such as that it does
not own physical storage. Am I missing something?

>
> + <entry><structfield>partexprs</structfield></entry>
>
> There's a certain symmetry between this and what we do for indexes,
> but I'm wondering whether there's a use case for partitioning a table
> by an expression rather than a column value. I suppose if you've
> already done the work, there's no harm in supporting it.

Yeah, it's not a whole lot of code to manage expressions alongside simple
column references.

> +[ PARTITION BY {RANGE | LIST} ( { <replaceable
> class="parameter">column_name</replaceable> | ( <replaceable
> class="parameter">expression</replaceable> ) } [ <replaceable
> class="parameter">opclass</replaceable> ] [, ...] )
>
> The spacing isn't right here. Should say { RANGE | LIST } rather than
> {RANGE|LIST}. Similarly elsewhere.

Will fix.

> + thus created is called <firstterm>partitioned</firstterm> table. Key
> + consists of an ordered list of column names and/or expressions when
> + using the <literal>RANGE</> method, whereas only a single column or
> + expression can be specified when using the <literal>LIST</> method.
>
> Why do we support range partitioning with multiple columns, but list
> partitioning only with a single column?

This is mostly because I didn't find any other database supporting it and
hence thought maybe there is not much use for it.

On the implementation side, I think it makes sense to think of ordering of
tuples (for a multi-column range key), where we compare a new tuple's
partition key columns one-by-one until we find the column such that its
value != the value of corresponding column in the range upper bound of a
partition (usually but not necessarily the last column of the key). Once
we have determined which side of the bound the value was, we then perform
the same process with either the lower bound of the same partition or
upper bound of some partition in the other half as determined by binary
search logic.

Conversely, when determining the constraint on rows a range partition
contains, we emit a (keycol = lowerval) expression for all columns i =
0..j-1 where range.lowerval[i] = range.upperval[i] and then emit (keycol >
/ >= range.lowerval[j] and keycol < / <= upperval[j]) for the first column
j where lowerval[j] < upperval[j] and stop at that column (any further
columns are irrelevant).

When considering the same for list partitioning where we consider set-
membership of values (based on equality of a new tuple's value for the
column and a partition's list of values), implementing multi-column logic
does not seem as straightforward. Also, it might make the optimizations
we recently discussed slightly more complicated to implement. Although I
may be missing something.

> + The type of a key column or an expresion must have an associated
>
> Typo.

Will fix.

>
> + <para>
> + Currently, there are following limitations on definition of partitioned
> + tables: one cannot specify any UNIQUE, PRIMARY KEY, EXCLUDE and/or
> + FOREIGN KEY constraints.
> + </para>
>
> But I presume you can define these constraints on the partitions;
> that's probably worth mentioning. I'd say something like this
> "Partitioned tables do not support UNIQUE, PRIMARY, EXCLUDE, or
> FOREIGN KEY constraints; however, you can define these constraints on
> individual data partitions."

Okay, will mention that.

> + if (partexprs)
> + recordDependencyOnSingleRelExpr(&myself,
> + (Node *) partexprs,
> + RelationGetRelid(rel),
> + DEPENDENCY_NORMAL,
> + DEPENDENCY_IGNORE);
>
> I don't think introducing a new DEPENDENCY_IGNORE type is a good idea
> here. Instead, you could just add an additional Boolean argument to
> recordDependencyOnSingleRelExpr. That seems less likely to create
> bugs in unrelated portions of the code.

I did consider a Boolean argument instead of a new DependencyType value,
however it felt a bit strange to pass a valid value for the fifth argument
(self_behavior) and then ask using a separate parameter that it (a
self-dependency) is to be ignored. By the way, no pg_depend entry is
created on such a call, so the effect of the new type's usage seems
localized to me. Thoughts?

>
> + /*
> + * Override specified inheritance option, if relation is a partitioned
> + * table and it's a target of INSERT.
> + */
> + if (alsoSource)
> + rte->inh |= relid_is_partitioned(rte->relid);
>
> This seems extremely unlikely to be the right way to do this. We're
> just doing parse analysis here, so depending on properties of the
> table that are (or might be made to be) changeable is not good. It's
> also an extra catalog lookup whether the table is partitioned or not
> (and whether rte->inh is already true or not). Furthermore, it seems
> to go against the comment explaining what rte->inh is supposed to
> mean, which says:
>
> * inh is TRUE for relation references that should be expanded to include
> * inheritance children, if the rel has any. This *must* be FALSE for
> * RTEs other than RTE_RELATION entries.
>
> I am not sure what problem you're trying to solve here, but I suspect
> this needs to be ripped out altogether or handled later, during
> planning. Similarly for the related change in addRangeTableEntry.

Okay, I will see how this could rather be done within the planner. My
intention here is to keep the code churn low by not modifying a lot of
places in the code where rte->inh is checked to turn inheritance on or
off. I agree though that it's a bit ugly and perhaps wrong.

> + {PartitionedRelationId, /* PARTEDRELID */
> + PartitionedRelidIndexId,
> + 1,
> + {
> + Anum_pg_partitioned_partedrelid,
> + 0,
> + 0,
> + 0
> + },
> + 128
>
> I'd probably cut the initial size of this down a bit. Odds are good
> that not all of the tables you access will be partitioned. Of course
> that assumes we'll avoid probing this syscache for non-partitioned
> tables, but I think that's pretty important.

Okay, I will turn that down to 64 maybe.

>
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> + errmsg("cannot alter column named in partition key")));
> + else
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_TABLE_DEFINITION),
> + errmsg("cannot alter column referenced in
> partition key expression")));
>
> Maybe "cannot alter type of column named in partition key"? And
> similarly for the other one.

Will fix.

>
> +/*
> + * transformPartitionBy
> + * Transform any expressions present in the partition key
> + */
> +static PartitionBy *
> +transformPartitionBy(Relation rel, PartitionBy *partitionby)
>
> Shouldn't this be in src/backend/parser/parse_*.c instead of tablecmds.c?

It can be performed only after the new relation is opened which can be
only after we have called heap_create_with_catalog() which is in
tablecmds.c: DefineRelation(). I think that's because we call
transformExpr() on partition expressions which expects to be able to see
the table's columns. Other callers such as transformIndexStmt and
transformAlterTableStmt can do transformations within parse_utilcmds.c
because they can open the relation there.

>
> Most of this (0001) looks pretty reasonable. I'm sure that there is
> more that needs fixing than what I've mentioned above, which is just
> what I saw on an initial read-through, but overall I like the
> approach.
>

Okay, thanks. I will post the updated patches soon.

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas 'ads' Scherbaum 2016-08-15 11:33:43 Re: to_date_valid()
Previous Message Haribabu Kommi 2016-08-15 11:17:21 pg_hba_file_settings view patch