Re: UPDATE of partition key

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UPDATE of partition key
Date: 2017-02-23 10:32:55
Message-ID: de76055a-b9bb-a7bd-dbf1-fdf6ddfe3ccb@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Amit,

Thanks for working on this.

On 2017/02/13 21:01, Amit Khandekar wrote:
> Currently, an update of a partition key of a partition is not allowed,
> since it requires to move the row(s) into the applicable partition.
>
> Attached is a WIP patch (update-partition-key.patch) that removes this
> restriction. When an UPDATE causes the row of a partition to violate
> its partition constraint, then a partition is searched in that subtree
> that can accommodate this row, and if found, the row is deleted from
> the old partition and inserted in the new partition. If not found, an
> error is reported.

That's clearly an improvement over what we have now.

> There are a few things that can be discussed about :
>
> 1. We can run an UPDATE using a child partition at any level in a
> nested partition tree. In such case, we should move the row only
> within that child subtree.
>
> For e.g. , in a tree such as :
> tab ->
> t1 ->
> t1_1
> t1_2
> t2 ->
> t2_1
> t2_2
>
> For "UPDATE t2 set col1 = 'AAA' " , if the modified tuple does not fit
> in t2_1 but can fit in t1_1, it should not be moved to t1_1, because
> the UPDATE is fired using t2.

Makes sense. One should perform the update by specifying tab such that
the row moves from t2 to t1, before we could determine t1_1 as the target
for the new row. Specifying t2 directly in that case is clearly the
"violates partition constraint" situation. I wonder if that's enough a
hint for the user to try updating the parent (or better still, root
parent). Or as we were discussing, should there be an actual HINT message
spelling that out for the user.

> 2. In the patch, as part of the row movement, ExecDelete() is called
> followed by ExecInsert(). This is done that way, because we want to
> have the ROW triggers on that (sub)partition executed. If a user has
> explicitly created DELETE and INSERT BR triggers for this partition, I
> think we should run those. While at the same time, another question
> is, what about UPDATE trigger on the same table ? Here again, one can
> argue that because this UPDATE has been transformed into a
> DELETE-INSERT, we should not run UPDATE trigger for row-movement. But
> there can be a counter-argument. For e.g. if a user needs to make sure
> about logging updates of particular columns of a row, he will expect
> the logging to happen even when that row was transparently moved. In
> the patch, I have retained the firing of UPDATE BR trigger.

What of UPDATE AR triggers?

As a comment on how row-movement is being handled in code, I wonder if it
could be be made to look similar structurally to the code in ExecInsert()
that handles ON CONFLICT DO UPDATE. That is,

if (partition constraint fails)
{
/* row movement */
}
else
{
/* ExecConstraints() */
/* heap_update(), EvalPlanQual(), and ExecInsertIndexTuples() */
}

I see that ExecConstraint() won't get called on the source partition's
constraints if row movement occurs. Maybe, that's unnecessary because the
new row won't be inserted into that partition anyway.

ExecWithCheckOptions() for RLS update check does happen *before* row
movement though.

> 3. In case of a concurrent update/delete, suppose session A has locked
> the row for deleting it. Now a session B has decided to update this
> row and that is going to cause row movement, which means it will
> delete it first. But when session A is finished deleting it, session B
> finds that it is already deleted. In such case, it should not go ahead
> with inserting a new row as part of the row movement. For that, I have
> added a new parameter 'already_delete' for ExecDelete().

Makes sense. Maybe: already_deleted -> concurrently_deleted.

> Of course, this still won't completely solve the concurrency anomaly.
> In the above case, the UPDATE of Session B gets lost. May be, for a
> user that does not tolerate this, we can have a table-level option
> that disallows row movement, or will cause an error to be thrown for
> one of the concurrent session.

Will this table-level option be specified for a partitioned table once or
for individual partitions?

> 4. The ExecSetupPartitionTupleRouting() is re-used for routing the row
> that is to be moved. So in ExecInitModifyTable(), we call
> ExecSetupPartitionTupleRouting() even for UPDATE. We can also do this
> only during execution time for the very first time we find that we
> need to do a row movement. I will think over that, but I am thinking
> it might complicate things, as compared to always doing the setup for
> UPDATE. WIll check on that.

Hmm. ExecSetupPartitionTupleRouting(), which does significant amount of
setup work, is fine being called in ExecInitModifyTable() in the insert
case because there are often cases where that's a bulk-insert and hence
cost of the setup work is amortized. Updates, OTOH, are seldom done in a
bulk manner. So that might be an argument for doing it late only when
needed. But that starts to sound less attractive when one realizes that
that will occur for every row that wants to move.

I wonder if updates that will require row movement when done will be done
in a bulk manner (as a maintenance op), so one-time tuple routing setup
seems fine. Again, enable_row_movement option specified for the parent
sounds like it would be a nice to have. Only do the setup if it's turned
on, which goes without saying.

> 5. Regarding performance testing, I have compared the results of
> row-movement with partition versus row-movement with inheritance tree
> using triggers. Below are the details :
>
> Schema :

[ ... ]

> parts partitioned inheritance no. of rows subpartitions
> ===== =========== =========== =========== =============
>
> 500 10 sec 3 min 02 sec 1,000,000 0
> 1000 10 sec 3 min 05 sec 1,000,000 0
> 1000 1 min 38sec 30min 50 sec 10,000,000 0
> 4000 28 sec 5 min 41 sec 1,000,000 10
>
> part : total number of partitions including subparitions if any.
> partitioned : Partitions created using declarative syntax.
> inheritence : Partitions created using inheritence , check constraints
> and insert,update triggers.
> subpartitions : Number of subpartitions for each partition (in a 2-level tree)
>
> Overall the UPDATE in partitions is faster by 10-20 times compared
> with inheritance with triggers.
>
> The UPDATE query moved all of the rows into another partition. It was
> something like this :
> update ptab set a = '1949-01-1' where a <= '1924-01-01'
>
> For a plain table with 1,000,000 rows, the UPDATE took 8 seconds, and
> with 10,000,000 rows, it took 1min 32sec.

Nice!

> In general, for both partitioned and inheritence tables, the time
> taken linearly rose with the number of rows.

Hopefully not also with the number of partitions though.

I will look more closely at the code soon.

Thanks,
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pantelis Theodosiou 2017-02-23 10:38:20 Re: Make subquery alias optional in FROM clause
Previous Message Jan Michálek 2017-02-23 10:29:11 Other formats in pset like markdown, rst, mediawiki