Re: UPDATE of partition key

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UPDATE of partition key
Date: 2017-11-13 12:55:45
Message-ID: CAKJS1f_9ggWb1KiGrUCHQWSiiSK89zO+iEGx+VEgyiWfxpmX1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10 November 2017 at 16:42, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
[ update-partition-key_v23.patch ]

Hi Amit,

Thanks for working on this. I'm looking forward to seeing this go in.

So... I've signed myself up to review the patch, and I've just had a
look at it, (after first reading this entire email thread!).

Overall the patch looks like it's in quite a good shape. I think I do
agree with Robert about the UPDATE anomaly that's been discussed. I
don't think we're painting ourselves into any corner by not having
this working correctly right away. Anyone who's using some trigger
workarounds for the current lack of support for updating the partition
key is already going to have the same issues, so at least this will
save them some troubles implementing triggers and give them much
better performance. I see you've documented this fact too, which is
good.

I'm writing this email now as I've just run out of review time for today.

Here's what I noted down during my first pass:

1. Closing command tags in docs should not be abbreviated

triggers are concerned, <literal>AFTER</> <command>DELETE</command> and

This changed in c29c5789. I think Peter will be happy if you don't
abbreviate the closing tags.

2. "about to do" would read better as "about to perform"

concurrent session, and it is about to do an <command>UPDATE</command>

I think this paragraph could be more clear if we identified the
sessions with a number.

Perhaps:
Suppose, session 1 is performing an <command>UPDATE</command> on a
partition key, meanwhile, session 2 tries to perform an <command>UPDATE
</command> or <command>DELETE</command> operation on the same row.
Session 2 can silently miss the row due to session 1's activity. In
such a case, session 2's <command>UPDATE</command>/<command>DELETE
</command>, being unaware of the row's movement, interprets this that the
row has just been deleted, so there is nothing to be done for this row.
Whereas, in the usual case where the table is not partitioned, or where
there is no row movement, the second session would have identified the
newly updated row and carried <command>UPDATE</command>/<command>DELETE
</command> on this new row version.

3. Integer width. get_partition_natts returns int but we assign to int16.

int16 partnatts = get_partition_natts(key);

Confusingly get_partition_col_attnum() returns int16 instead of AttrNumber
but that's existingly not correct.

4. The following code could just pull_varattnos(partexprs, 1, &child_keycols);

foreach(lc, partexprs)
{
Node *expr = (Node *) lfirst(lc);

pull_varattnos(expr, 1, &child_keycols);
}

5. Triggers. Do we need a new "TG_" tag to allow trigger functions to
do something
special when the DELETE/INSERT is a partition move? I have audit
tables in mind here
it may appear as though a user performed a DELETE when they actually
performed an UPDATE
giving visibility of this to the trigger function will allow the
application to work around this.

6. change "row" to "a row" and "old" to "the old"

* depending on whether the event is for row being deleted from old

But to be honest, I'm having trouble parsing the comment. I think it
would be better to
say explicitly when the row will be NULL rather than "depending on
whether the event"

7. I'm confused with how this change came about. If the old comment
was correct here then the comment you're referring to here should
remain in ExecPartitionCheck(), but you're saying it's in
ExecConstraints().

/* See the comments in ExecConstraints. */

If the comment really is in ExecConstraints(), then you might want to
give an overview of what you mean, then reference ExecConstraints() if
more details are required.

8. I'm having trouble parsing this comment:

* 'update_rri' has the UPDATE per-subplan result rels.

I think "has" should be "contains" ?

9. Also, this should likely be reworded:

* 'num_update_rri' : number of UPDATE per-subplan result rels. For INSERT,
* this is 0.

'num_update_rri' number of elements in 'update_rri' array or zero for INSERT.

10. There should be no space before the '?'

/* Is this leaf partition present in the update resultrel ? */

11. I'm struggling to understand this comment:

* This is required when converting tuple as per root
* partition tuple descriptor.

"tuple" should probably be "the tuple", but not quite sure what you
mean by "as per root".

I may have misunderstood, but maybe it should read:

* This is required when we convert the partition's tuple to
* be compatible with the partitioned table's tuple descriptor.

12. I think "as well" would be better written as "either".

* If we didn't open the partition rel, it means we haven't
* initialized the result rel as well.

13. I'm unsure what is meant by the following comment:

* Verify result relation is a valid target for insert operation. Even
* for updates, we are doing this for tuple-routing, so again, we need
* to check the validity for insert operation.

I'm not quite sure where UPDATE comes in here as we're only checking for INSERT?

14. Use of underscores instead of camelCase.

COPY_SCALAR_FIELD(part_cols_updated);

I know you're not the first one to break this as "partitioned_rels"
does not follow it either, but that's probably not a good enough
reason to break away from camelCase any further.

I'd suggest "partColsUpdated". But after a re-think, maybe cols is
incorrect. All columns are partitioned, it's the key columns that we
care about, so how about "partKeyUpdate"

15. Are you sure that you mean "root" here?

* All the child partition attribute numbers are converted to the root
* partitioned table.

Surely this is just the target relation. "parent" maybe? A
sub-partitioned table might be the target of an UPDATE too.

15. I see get_all_partition_cols() is just used once to check if
parent_rte->updatedCols contains and partition keys.

Would it not be better to reform that function and pass
parent_rte->updatedCols in and abort as soon as you see a single
match?

Maybe the function could return bool and be named
partitioned_key_overlaps(), that way your assignment in
inheritance_planner() would just become:

part_cols_updated = partitioned_key_overlaps(root->parse->rtable,
top_parentRTindex, partitioned_rels, parent_rte->updatedCols);

or something like that anyway.

16. Typo in comment

* 'part_cols_updated' if any partitioning columns are being updated, either
* from the named relation or a descendent partitione table.

"partitione" should be "partitioned". Also, normally for bool
parameters, we might word things like "True if ..." rather than just
"if"

You probably should follow camelCase I mentioned in 14 here too.

17. Comment needs a few changes:

* ConvertPartitionTupleSlot -- convenience function for converting tuple and
* storing it into a tuple slot provided through 'new_slot', which typically
* should be one of the dedicated partition tuple slot. Passes the partition
* tuple slot back into output param p_old_slot. If no mapping present, keeps
* p_old_slot unchanged.
*
* Returns the converted tuple.

There are a few typos here. For example, "tuple" should be "a tuple",
but maybe the comment should just be worded like:

* ConvertPartitionTupleSlot -- convenience function for tuple conversion
* using 'map'. The tuple, if converted, is stored in 'new_slot' and
* 'p_old_slot' is set to the original partition tuple slot. If map is NULL,
* then the original tuple is returned unmodified, otherwise the converted
* tuple is returned.

18. Line goes over 80 chars.

TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

Better just to split the declaration and assignment.

19. Confusing comment:

/*
* If the original operation is UPDATE, the root partitioned table
* needs to be fetched from mtstate->rootResultRelInfo.
*/

It's not that clear here how you determine this is an UPDATE of a
partitioned key.

20. This code looks convoluted:

rootResultRelInfo = (mtstate->rootResultRelInfo ?
mtstate->rootResultRelInfo : resultRelInfo);

/*
* If the resultRelInfo is not the root partitioned table (which
* happens for UPDATE), we should convert the tuple into root's tuple
* descriptor, since ExecFindPartition() starts the search from root.
* The tuple conversion map list is in the order of
* mtstate->resultRelInfo[], so to retrieve the one for this resultRel,
* we need to know the position of the resultRel in
* mtstate->resultRelInfo[].
*/
if (rootResultRelInfo != resultRelInfo)
{

rootResultRelInfo is assigned via a ternary expression which makes the
subsequent if test seem a little strange.

Would it not be better to test:

if (mtstate->rootResultRelInfo)
{
rootResultRelInfo = mtstate->rootResultRelInfo
... other stuff ...
}
else
rootResultRelInfo = resultRelInfo;

Then above the if test you can explain that rootResultRelInfo is only
set during UPDATE of partition keys, as per #19.

21. How come you renamed mt_partition_tupconv_maps[] to
mt_parentchild_tupconv_maps[]?

22. Comment in ExecInsert() could be worded better.

/*
* In case this is part of update tuple routing, put this row into the
* transition NEW TABLE if we are capturing transition tables. We need to
* do this separately for DELETE and INSERT because they happen on
* different tables.
*/

/*
* This INSERT may be the result of a partition-key-UPDATE. If so,
* and we're required to capture transition tables then we'd better
* record this as a statement level UPDATE on the target relation.
* We're not interested in the statement level DELETE or INSERT as
* these occur on the individual partitions, none of which are the
* target of this the UPDATE statement.
*/

A similar comment could use a similar improvement in ExecDelete()

23. Line is longer than 80 chars.

TransitionCaptureState *transition_capture = mtstate->mt_transition_capture;

24. I know from reading the thread this name has changed before, but I
think delete_skipped seems like the wrong name for this variable in:

if (delete_skipped)
*delete_skipped = true;

Skipped is the wrong word here as that indicates like we had some sort
of choice and that we decided not to. However, that's not the case
when the tuple was concurrently deleted. Would it not be better to
call it "tuple_deleted" or even "success" and reverse the logic? It's
just a bit confusing that you're setting this to skipped before
anything happens. It would be nicer if there was a better way to do
this whole thing as it's a bit of a wart in the code. I understand why
the code exists though.

Also, I wonder if it's better to always pass a boolean here to save
having to test for NULL before setting it, that way you might consider
putting the success = false just before the return NULL, then do
success = true after the tuple is gone.
Failing that, putting: something like: success = false; /* not yet! */
where you're doing the if (deleted_skipped) test is might also be
better.

25. Comment "we should" should be "we must".

/*
* For some reason if DELETE didn't happen (for e.g. trigger
* prevented it, or it was already deleted by self, or it was
* concurrently deleted by another transaction), then we should
* skip INSERT as well, otherwise, there will be effectively one
* new row inserted.

Maybe just:
/* If the DELETE operation was unsuccessful, then we must not
* perform the INSERT into the new partition.

"for e.g." is not really correct in English. "For example, ..." or
just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've
written "For exempli gratia", which translates to "For for example".

26. You're not really explaining what's going on here:

if (mtstate->mt_transition_capture)
saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

You have a comment later to say you're about to "Revert back to the
transition capture map", but I missed the part that explained about
modifying it in the first place.

27. Comment does not explain how we're skipping checking the partition
constraint check in:

* We have already checked partition constraints above, so skip
* checking them here.

Maybe something like:

* We've already checked the partition constraint above, however, we
* must still ensure the tuple passes all other constraints, so we'll
* call ExecConstraints() and have it validate all remaining checks.

28. For table WITH OIDs, the OID should probably follow the new tuple
for partition-key-UPDATEs.

CREATE TABLE p (a BOOL NOT NULL, b INT NOT NULL) PARTITION BY LIST (a)
WITH OIDS;
CREATE TABLE P_true PARTITION OF p FOR VALUES IN('t');
CREATE TABLE P_false PARTITION OF p FOR VALUES IN('f');
INSERT INTO p VALUES('t', 10);
SELECT tableoid::regclass,oid,a FROM p;
tableoid | oid | a
----------+-------+---
p_true | 16792 | t
(1 row)

UPDATE p SET a = 'f'; -- partition-key-UPDATE (oid has changed (it
probably shouldn't have))
SELECT tableoid::regclass,oid,a FROM p;
tableoid | oid | a
----------+-------+---
p_false | 16793 | f
(1 row)

UPDATE p SET b = 20; -- non-partition-key-UPDATE (oid remains the same)

SELECT tableoid::regclass,oid,a FROM p;
tableoid | oid | a
----------+-------+---
p_false | 16793 | f
(1 row)

I'll try to continue with the review tomorrow, but I think some other
reviews are also looming too.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2017-11-13 13:04:34 Migration to pglister - Before
Previous Message Pavel Stehule 2017-11-13 12:30:58 Re: proposal: schema variables