Re: Declarative partitioning - another take

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, Rajkumar Raghuwanshi <rajkumar(dot)raghuwanshi(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Declarative partitioning - another take
Date: 2016-11-04 00:16:57
Message-ID: CA+TgmoY=tGtVh-nLiaZGT7cTNGHdWbpe+2JRsDLmG_sGbC4QZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Apologies if I've made some of these comments before and/or missed
comments you've made on these topics. The size of this patch set is
so large that it's hard to keep track of everything.

Re-reviewing 0001:

+ indicate which table columns are used as partition key. For example,

s/are used as/are part of the/

+ third table columns make up the partition key. A zero in this array
+ indicates that the corresponding partition key column is an expression
+ over the table columns, rather than a simple column reference.

I think you can leave out "over the table columns".

+ columns or expressions forms the <firstterm>partitioning key</firstterm>

s/forms/form/

+ The table itself is empty. A data row inserted into the table is routed

s/The table/The partitioned table/

+ * Anything mentioned in the expressions. We must ignore the column
+ * references which will count as self-dependency items; in this case,
+ * the depender is the table itself (there is no such thing as partition
+ * key object).

"depender" is not really a word, and the parenthetical note isn't very
clear. Maybe: We must ignore the column references, which will depend
on the table itself; there is no separate partition key object.

+ heap_close(pg_partitioned_table, RowExclusiveLock);

It seems like it would be safe to do this right after
CatalogUpdateIndexes(pg_partitioned_table, tuple), and I'd recommend
that you do. Not for performance or anything, but just to keep
related code together.

/*
* Resolve possibly-defaulted operator class specification
*/
-static Oid
+Oid
GetIndexOpClass(List *opclass, Oid attrType,

Perhaps we should rename this function to ResolveOpClass, since it's
now going to be used for both indexes and partitioning keys.

+ * Sets *is_expr if attnum is found to be referenced in some partition key
+ * expression.

is_expr doesn't seem quite as clear as, say, used_by_expr or used_in_expr.

Also, the specification for this function doesn't seem to be very
clear about what this is supposed to do if the same column is both an
explicit partitioning column and also used in an expression, and the
code looks like it'll return with *is_expr set based on whichever use
it encounters first. If that's intended behavior, maybe add a comment
like: It's possible for a column to be used both directly and as part
of a partition key expression; if that happens, *is_expr may end up as
either true or false. That's OK for current uses of this function,
because *is_expr is only used to tailor the error message text.

+ if (is_expr)
+ *is_expr = false;
+ if (attnum == partattno)
+ return true;

I think you should adjust this (and the other bit in the same
function) so that you don't set *is_expr until you're committed to
returning.

+ index = -1;
+ while ((index = bms_next_member(expr_attrs, index)) > 0)
+ {
+ AttrNumber attno = index + FirstLowInvalidHeapAttributeNumber;
+
+ if (attno == attnum)
+ return true;
+ }

How about bms_is_member(expr_attrs, attnum -
FirstLowInvalidHeapAttributeNumber), instead of looping?

+ errmsg("cannot reference relation \"%s\"",
RelationGetRelationName(pkrel)),
+ errdetail("Referencing partitioned tables in foreign
key constraints is not supported.")));

I think you could get rid of the errdetail and just have the error
message be "cannot reference partitioned table \"%s\"".

+ errmsg("column \"%s\" appears twice in
partition key", pelem->name),

It could be there three times! How about column \"%s\" appears more
than once in partition key? (I see that you seem to have adapted this
from some code in parse_utilcmd.c, which perhaps should also be
adjusted, but that's a job for another patch.)

+ /*
+ * Strip any top-level COLLATE clause. This ensures that we treat
+ * "x COLLATE y" and "(x COLLATE y)" alike.
+ */

But you don't, right? Unless I am confused, which is possible, the
latter COLLATE will be ignored, while the former one will set the
collation to be used in the context of partitioning comparisons.

Re-reviewing 0002:

+ if (fout->remoteVersion >= 100000)
+ {
+ PQExpBuffer acl_subquery = createPQExpBuffer();
+ PQExpBuffer racl_subquery = createPQExpBuffer();
+ PQExpBuffer initacl_subquery = createPQExpBuffer();
+ PQExpBuffer initracl_subquery = createPQExpBuffer();
+
+ PQExpBuffer attacl_subquery = createPQExpBuffer();
+ PQExpBuffer attracl_subquery = createPQExpBuffer();
+ PQExpBuffer attinitacl_subquery = createPQExpBuffer();
+ PQExpBuffer attinitracl_subquery = createPQExpBuffer();

It seems unnecessary to repeat all of this. The only differences
between the 10000 version and the 9600 version are:

60,61c60
< "AS changed_acl, "
< "CASE WHEN c.relkind = 'P' THEN
pg_catalog.pg_get_partkeydef(c.oid) ELSE NULL END AS partkeydef "
---
> "AS changed_acl "
73c72
< "WHERE c.relkind in ('%c', '%c', '%c', '%c',
'%c', '%c', '%c') "
---
> "WHERE c.relkind in ('%c', '%c', '%c', '%c', '%c', '%c') "
87,88c86
< RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE,
< RELKIND_PARTITIONED_TABLE);
---
> RELKIND_MATVIEW, RELKIND_FOREIGN_TABLE);

But none of that is really a problem. Sure, the 'P' case will never
arise in 9.6, but so what? I'd really like to not keep duplicating
these increasingly-complex hunks of code if we can find some way to
avoid that.

/* find all the inheritance information */

- appendPQExpBufferStr(query, "SELECT inhrelid, inhparent FROM
pg_inherits");
+ appendPQExpBufferStr(query,
+ "SELECT inhrelid, inhparent "
+ "FROM pg_inherits "
+ "WHERE inhparent NOT
IN (SELECT oid FROM pg_class WHERE relkind = 'P')");

I think you need to update the comment. "Find inheritance
information, excluding implicit inheritance via partitioning. We're
not interested in that case because $REASON."

Re-reviewing 0003:

+ <para>
+ If this table is a partition, one cannot perform <literal>DROP
NOT NULL</>
+ on a column if it is marked not null in the parent table.
+ </para>

This would, presumably, also be true for inheritance. I think we
could just leave this out.

+ as partition of the target table. The partition bound specification must

s/as partition/as a partition/

+ correspond to the partitioning method and partitioning key of the target

I think that in most places were are referring to the "partitioning
method" (with ing) but the "partition key" (without ing). Let's try to
be consistent.

+ table. The table to be attached must have all the columns as the target
+ table and no more; moreover, the column types must also match. Also, it
+ must have all the matching constraints as the target table.

s/all the columns/all of the same columns/

The second sentence doesn't seem quite grammatical. And why would
that be true anyway? Partitions can have extra constraints, and if
they lack constraints that are present on the partitioned table, those
constraints will be added and validated, right?

+ A full table scan is performed on the table being attached to check that
+ existing rows in the table are within the specified partition bound.
+ If it is known in advance that no partition bound violating rows are
+ present in the table, the above scan can be skipped by specifying the
+ <literal>NO VALIDATE</> option.
+ Default behavior is to perform the scan, as if the
<literal>VALIDATE</literal>
+ option were specified.

I don't think it's OK to allow the partition to be added if it
contains rows that might not be valid. We are generally vary wary
about adding options that can cause data integrity failures and I
think we shouldn't start here, either. On the other hand, it's also
not desirable for adding a partition to take O(n) time in all cases.
So what would be nice is if the new partition could self-certify that
contains no problematic rows by having a constraint that matches the
new partitioning constraint. Then we can skip the scan if that
constraint is already present.

+ inherited columns. One can also specify table constraints, in addition

Delete comma.

+ to those inherited from the parent. If a check constraint with the name
+ matching one of the parent's constraint is specified, it is merged with
+ the latter, provided the specified condition is same.

Doesn't that effectively delete the merged constraint?

Suggest: "If the parent already has a check constraint with the same
name as a constraint specified for the child, the conditions must be
the same."

+) FOR VALUES IN ('los angeles', 'san fransisco');

That's not you you spell San Francisco.

+ Create partition of a list partitioned table that itself is further
+ partitioned and then create its partition:

s/itself is/is itself/
s/then create its partition/then add a partition to it/

+ if (!is_local || con->coninhcount == 0)
+ con->coninhcount++;

I would think that you could get rid of the "if" and just say
con->coninhcount = 1. It seems to me (and perhaps the comment could
say this) that for a partitioned table, we can simplify the handling
of coninhcount and conislocal. Since partitioning hierarchies don't
allow multiple inheritance, any inherited constraint must be inherited
exactly once. Since a partition cannot have inheritance children --
except by being partitioned itself -- there is no point in tracking
conislocal, so we always store it as false.

+
+void
+StorePartitionBound(Relation rel, Node *bound)

Header comment, please!

+ (void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
+ &isnull);
+ Assert(isnull);

We try not to do unnecessary work in non-assert-enabled builds solely
for the benefit of assert-enabled builds. We also try not to end up
with variables that are only used in assert-enabled builds but not
marked PG_USED_FOR_ASSERTS_ONLY, because those tend to cause compiler
warnings. I'm not sure an compiler would be smart enough to warn
about this, but I suggest adding an #ifdef USE_ASSERT_CHECKING with a
block inside where the offending variable is declared. Actually, I
think you need to move a bit more code. Hmm. Something like:

#ifdef USE_ASSERT_CHECKING
{
Form_pg_class classForm;
bool isnull;

classForm = (Form_pg_class) GETSTRUCT(tuple);
Assert(!classForm->relispartition);

(void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
&isnull);
Assert(isnull);
}
#endif

+ * are same in common cases, of which we only store one.

"and we only store one of them."

+ /*
+ * Never put a
null into the values array, flag
+ * instead for
the code further down below where
+ * we
construct the actual relcache struct.
+ */
+
found_null_partition = true;
+
null_partition_index = i;

How about we elog(ERROR, ...) if found_null_partition is already set?

+ foreach(cell, non_null_values)
+ {
+ PartitionListValue *src =
lfirst(cell);
+
+ all_values[i] = (PartitionListValue *)
+
palloc(sizeof(PartitionListValue));
+ all_values[i]->value =
datumCopy(src->value,
+
key->parttypbyval[0],
+
key->parttyplen[0]);
+ all_values[i]->index = src->index;
+ i++;
+ }

Why do we need to datumCopy() here when we just did that in the
previous loop? Why did the previous loop need to datumCopy(), anyway?
Can't we just pass the same datums around? I understand that you
need to copy the data into the correct context at the end, but doing
two copies prior to that seems excessive.

+get_leaf_partition_oids(Oid relid, int lockmode)

Instead of implementing this, can you just use find_all_inheritors()?

+ /*
+ * Is lower_val = upper_val?
+ */
+ if (lower_val && upper_val)

So, that comment does not actually match that code. That if-test is
just checking whether both bounds are finite. What I think you should
be explaining here is that if lower_val and upper_val are both
non-infinite, and if the happen to be equal, we want to emit an
equality test rather than a >= test plus a <= test because $REASON.

+ operoid = get_opfamily_member(key->partopfamily[col],
+
key->parttypid[col],
+
key->parttypid[col],
+ strategy);
+ if (!OidIsValid(operoid))
+ {
+ operoid = get_opfamily_member(key->partopfamily[col],
+
key->partopcintype[col],
+
key->partopcintype[col],
+
strategy);
+ *need_relabel = true;
+ }

There's a comment explaining THAT you do this ("Use either the column
type as the operator datatype or opclass's declared input type.") but,
to repeat a complaint that I've made often, nothing explaining why.
In this particular case, what's not clear - in my opinion - is why you
need to try two different possibilities, and why at least one of those
possibilities is guaranteed to work. I gather that if the opfamily
doesn't contain an operator for the actual type of the partitioning
column, you think it will certainly contain one for the input type of
the operator class (which seems right), and that the input type of the
operator class will be binary-compatible with the type of the
partitioning column (which is less-obviously true, and needs some
explanation).

I also think that this function should elog(ERROR, ...) if by any
chance the second get_opfamily_member() call also fails. Otherwise the
error might occur quite a bit downstream and be hard to understand.

+ ReleaseSysCache(tuple);
+ heap_close(parent, AccessShareLock);

I think you had better not release the lock here - i.e. pass NoLock.
We don't release locks on tables until transaction commit, except for
catalog tables. This also comes up in, at least,
transformPartitionOf().

+ PartitionRangeBound *b1 = (*(PartitionRangeBound *const *) a);
+ PartitionRangeBound *b2 = (*(PartitionRangeBound *const *) b);
+ PartitionKey key = (PartitionKey) arg;
+
+ return partition_rbound_cmp(key, b1, b2);

Whitespace.

+ * as partition and schema consists of columns definitions corresponding

the schema consists

- if (recurse)
+ /* Force inheritance recursion, if partitioned table. */
+ if (recurse || rel->rd_rel->relkind ==
RELKIND_PARTITIONED_TABLE)

I would instead error out if ONLY is specified for a partitioned table.

+ /*
+ * If the table is source table of ATTACH PARTITION
command, following
+ * check is unnecessary.
+ */

As usual, comment should say WHY.

+ if (partqualstate && !ExecQual(partqualstate,
econtext, true))
+ ereport(ERROR,
+
(errcode(ERRCODE_CHECK_VIOLATION),
+ errmsg("child table
contains a row violating partition bound specification")));

Why not mimic the existing phrasing? "partition constraint is
violated by some row"

What happens if you try to attach a table as a partition of itself or
one of its ancestors?

- errmsg("column \"%s\" in child table
must be marked NOT NULL",
- attributeName)));
+ errmsg("column \"%s\"
in child table must be marked NOT NULL",
+
attributeName)));

Whitespace-only hunk; revert. You cannot fight the power of pgindent.

+ errmsg("cannot attach table that is a
inheritance child as partition")));

an inheritance child

+ errmsg("cannot attach a temporary relation of another
session as partition ")));

Extra space.

+ errdetail("Table being
attached should contain only the columns"
+ " present
in parent.")));

Suggest: "New partition should contain only..."

Also, don't break error messages into multiple strings. Make it one
long string and let pgindent deal.

+ | FOR VALUES START '(' range_datum_list ')' lb_inc
+ END_P '('
range_datum_list ')' ub_inc

Just a random idea. Would for VALUES FROM ( ... ) TO ( ... ) be more
idiomatic than START and END?

+static void
+transformPartitionOf(CreateStmtContext *cxt, Node *bound)
+{
+ TupleDesc tupdesc;
+ int i;
+ RangeVar *part = cxt->relation;
+ RangeVar *partof = linitial(cxt->inhRelations);
+ Relation parentRel;
+
+ parentRel = heap_openrv(partof, AccessShareLock);
+

If there's anyway that we might do heap_openrv() on the same RangeVar
at multiple places in the code, it presents security and integrity
hazards because the referent of that RangeVar might change in the
meantime. I suspect there is such a risk here - won't we need to open
the relation again when we actually want to do the operation?

Also, we should never acquire a lower-level lock on a relation and
then, further downstream, acquire a higher-level lock on the same
object. To do so creates a deadlock hazard. That seems like it might
be a problem here, too.

+ /*if (ldatum->infinite && rdatum->infinite)
+ ereport(ERROR,
+
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ errmsg("both START
and END datum for column \"%s\" cannot be UNBOUNDED",
+ colname),
+
parser_errposition(cxt->pstate, rdatum->location)));*/

Commented-out code is bad.

+ if (list_length(spec->lowerdatums) > partnatts)
+ ereport(ERROR,
+
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ errmsg("START has more values
than the length of the partition key"),
+
parser_errposition(cxt->pstate,
+
exprLocation(list_nth(spec->lowerdatums,
+
list_length(spec->lowerdatums) - 1)))));
+ else if (list_length(spec->lowerdatums) < partnatts)
+ ereport(ERROR,
+
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+ errmsg("START has fewer
values than the length of the partition key"),
+
parser_errposition(cxt->pstate,
+
exprLocation(list_nth(spec->lowerdatums,
+
list_length(spec->lowerdatums) - 1)))));

It would be worth looking for a way to unify these cases. Like "START
must specify exactly one value per partitioning column".

+ * Same oids? No mind order - in the list case, it
matches the order
+ * in which partition oids are returned by a
pg_inherits scan, whereas
+ * in the range case, they are in order of ranges of individual
+ * partitions. XXX - is the former unsafe?

Probably. It likely depends on the physical ordering of tuples in the
table, which can change.

+ * BoundCollection encapsulates a set of partition bounds of either physical
+ * or logical relations. It is associated with a partitioned relation of
+ * which the aforementioned relations are partitions.

"physical or logical relations" is unfamiliar terminology.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2016-11-04 01:44:45 Re: [BUGS] BUG #14350: VIEW with INSTEAD OF INSERT TRIGGER and COPY. Missing feature or working as designed.
Previous Message Craig Ringer 2016-11-04 00:14:53 Re: C based plugins, clocks, locks, and configuration variables