Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, amul sul <sulamul(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key
Date: 2018-04-05 01:44:39
Message-ID: 20180405014439.fbezvbjrmcw64vjc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

Hi,

On 2018-04-02 11:26:38 -0400, Robert Haas wrote:
> On Wed, Mar 28, 2018 at 2:12 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > How will it break it? They'll see an invalid ctid and conclude that the
> > tuple is dead? Without any changes that's already something that can
> > happen if a later tuple in the chain has been pruned away. Sure, that
> > code won't realize it should error out because the tuple is now in a
> > different partition, but neither would a infomask bit.
> >
> > I think my big problem is that I just don't see what the worst that can
> > happen is. We'd potentially see a broken ctid chain, something that very
> > commonly happens, and consider the tuple to be invisible. That seems
> > pretty sane behaviour for unadapted code, and not any worse than other
> > potential solutions.
>
> This is more or less my feeling as well. I think it's better to
> conserve our limited supply of infomask bits as much as we can, and I
> do think that we should try to reclaimed HEAP_MOVED_IN and
> HEAP_MOVED_OFF in the future instead of defining the combination of
> the two of them to mean something now.

Yep.

It'd also make locking more complicated or require to keep more
information around in HeapUpdateFailureData. In a number of places we
currently release the buffer pin before switching over heap_lock_tuple
etc results, or there's not even a way to get at the infomask currently
(heap_update failing).

> Modulo implementation quality, I think the risk
> level of this patch is somewhat but not vastly higher than
> 37484ad2aacef5ec794f4dd3d5cf814475180a78, which similarly defined a
> previously-unused bit pattern in the tuple header.

Personally I think that change was vastly riskier, because it affected
freezing and wraparounds. Which is something we've repeatedly gotten
wrong.

> The reason I think this one might be somewhat riskier is because
> AFAICS it's not so easy to make sure we've found all the code, even in
> core, that might care, as it was in that case; and also because
> updates happen more than freezing.

Butthe consequences of not catching a changed piece of code are fairly
harmless. And I'd say things that happen more often are actually easier
to validate than something that with default settings requires hours of
testing...

I've attached a noticeably editorialized patch:

- I'm uncomfortable with the "moved" information not being crash-safe /
replicated. Thus I added a new flag to preserve it, and removed the
masking of the moved bit in the ctid from heap_mask().

- renamed macros to not mention valid / invalid block numbers, but
rather
HeapTupleHeaderSetMovedPartitions / HeapTupleHeaderIndicatesMovedPartitions
and
ItemPointerSetMovedPartitions / ItemPointerIndicatesMovedPartitions

I'm not wedded to these names, but I'l be adamant they they're not
talking about invalid block numbers. Makes code harder to understand
imo.

- removed new assertion from heap_get_latest_tid(), it's wrong for the
case where all row versions are invisible.

- editorialized comments a bit

- added a few more assertions

I went through the existing code to make sure that
a) no checks where missed
b) to evaluate what the consequences when chasing chains would be
c) to evaluate what the consequences when we miss erroring out

WRT b), it's usually just superflous extra work if the new checks
weren't there. I went through all callers accessing xmax (via GetRawXmax
and GetUpdateXid):

b)
- heap rewrites will keep a tuple in hashtable till end of run, then
reset the ctid to self. No real corruption, but we'd not detect
further errors when attempting to follow chain.
- EvalPlanQualFetch would fail to abort loop, attempt to fetch
tuple. This'll extend the relation by a single page, because P_NEW ==
InvalidBlockNumber.
- heap_prune_chain - no changes needed (delete isn't recursed through)
- heap_get_root_tuples - same
- heap_hot_search_buffer - only continues over hot updates
- heap_lock_tuple (and subsidiary routines) - same as EvalPlanQualFetch,
would then return HeapTupleUpdated.

c)
- GetTupleForTrigger - the proper error wouldn't be raised, instead a
NULL tuple would be passed to the trigger
- EvalPlanQualFetch - a NULL tuple would be returned after the
consequences above
- RelationFindReplTupleBy* - wrong error message
- ExecLockRows - no error would be raised, continue normally
- ExecDelete() - tuple ignored without error
- ExecUpdate() - same

Questions:

- I'm not perfectly happy with
"tuple to be locked was already moved to another partition due to concurrent update"
as the error message. If somebody has a better suggestions.
- should heap_get_latest_tid() error out when the chain ends in a moved
tuple? I personally think it doesn't matter much, the functionality
is so bonkers and underspecified that it doesn't matter anyway ;)
- I'm not that happy with the number of added spec test files with
number postfixes. Can't we combine them into a single file?
- as remarked elsewhere on this thread, I think the used errcode should
be a serialization failure

Greetings,

Andres Freund

Attachment Content-Type Size
v8-0001-Raise-error-when-affecting-tuple-moved-into-diffe.patch text/x-diff 42.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-04-05 02:01:46 Re: [HACKERS] Runtime Partition Pruning
Previous Message Michael Paquier 2018-04-05 01:38:18 Re: BUG #14999: pg_rewind corrupts control file global/pg_control