Re: UPDATE of partition key

From: Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UPDATE of partition key
Date: 2017-02-20 09:28:29
Message-ID: CAJ3gD9drFBmqWfsBTamYGkkyM4drps9JGRpeL2CCUtFy3Vf8bg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 16 February 2017 at 20:53, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> On 13 February 2017 at 12:01, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com> wrote:
>>> There are a few things that can be discussed about :
>>
>> If you do a normal update the new tuple is linked to the old one using
>> the ctid forming a chain of tuple versions. This tuple movement breaks
>> that chain. So the question I had reading this proposal is what
>> behaviour depends on ctid and how is it affected by the ctid chain
>> being broken.
>
> I think this is a good question.
>
>> I think the concurrent update case is just a symptom of this. If you
>> try to update a row that's locked for a concurrent update you normally
>> wait until the concurrent update finishes, then follow the ctid chain
>> and recheck the where clause on the target of the link and if it still
>> matches you perform the update there.
>
> Right. EvalPlanQual behavior, in short.
>
>> At least you do that if you have isolation_level set to
>> repeatable_read. If you have isolation level set to serializable then
>> you just fail with a serialization failure. I think that's what you
>> should do if you come across a row that's been updated with a broken
>> ctid chain even in repeatable read mode. Just fail with a
>> serialization failure and document that in partitioned tables if you
>> perform updates that move tuples between partitions then you need to
>> be ensure your updates are prepared for serialization failures.
>
> Now, this part I'm not sure about. What's pretty clear is that,
> barring some redesign of the heap format, we can't keep the CTID chain
> intact when the tuple moves to a different relfilenode. What's less
> clear is what to do about that. We can either (1) give up on
> EvalPlanQual behavior in this case and act just as we would if the row
> had been deleted; no update happens.

This is what the current patch has done.

> or (2) throw a serialization
> error. You're advocating for #2, but I'm not sure that's right,
> because:
>
> 1. It's a lot more work,
>
> 2. Your proposed implementation needs an on-disk format change that
> uses up a scarce infomask bit, and
>
> 3. It's not obvious to me that it's clearly preferable from a user
> experience standpoint. I mean, either way the user doesn't get the
> behavior that they want. Either they're hoping for EPQ semantics and
> they instead do a no-op update, or they're hoping for EPQ semantics
> and they instead get an ERROR. Generally speaking, we don't throw
> serialization errors today at READ COMMITTED, so if we do so here,
> that's going to be a noticeable and perhaps unwelcome change.
>
> More opinions welcome.

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2017-02-20 09:30:22 Re: case_preservation_and_insensitivity = on
Previous Message Amit Kapila 2017-02-20 09:21:05 Re: Gather Merge