Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mason Sharp <masonlists(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Date: 2023-01-10 01:07:02
Message-ID: 20230110010702.elxqonr3icren3xs@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'm a bit worried that this is optimizing the rare case while hurting the
common case. See e.g. my point below about creating additional slots in the
happy path.

It's also not clear that change is right directionally. If we want to avoid
re-fetching the "original" row version, why don't we provide that
functionality via table_tuple_lock()?

On 2023-01-09 13:29:18 +0300, Alexander Korotkov wrote:
> @@ -53,6 +53,12 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
> HeapTuple tuple,
> OffsetNumber tupoffset);
>
> +static TM_Result heapam_tuple_lock_internal(Relation relation, ItemPointer tid,
> + Snapshot snapshot, TupleTableSlot *slot,
> + CommandId cid, LockTupleMode mode,
> + LockWaitPolicy wait_policy, uint8 flags,
> + TM_FailureData *tmfd, bool updated);
> +
> static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
>
> static const TableAmRoutine heapam_methods;
> @@ -299,14 +305,39 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
> static TM_Result
> heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
> Snapshot snapshot, Snapshot crosscheck, bool wait,
> - TM_FailureData *tmfd, bool changingPart)
> + TM_FailureData *tmfd, bool changingPart,
> + TupleTableSlot *lockedSlot)
> {
> + TM_Result result;
> +
> /*
> * Currently Deleting of index tuples are handled at vacuum, in case if
> * the storage itself is cleaning the dead tuples by itself, it is the
> * time to call the index tuple deletion also.
> */
> - return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
> + result = heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
> +
> + /*
> + * If the tuple has been concurrently updated, get lock already so that on
> + * retry it will succeed, provided that the caller asked to do this by
> + * providing a lockedSlot.
> + */
> + if (result == TM_Updated && lockedSlot != NULL)
> + {
> + result = heapam_tuple_lock_internal(relation, tid, snapshot,
> + lockedSlot, cid, LockTupleExclusive,
> + LockWaitBlock,
> + TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> + tmfd, true);

You're ignoring the 'wait' parameter here, no? I think the modification to
heapam_tuple_update() has the same issue.

> + if (result == TM_Ok)
> + {
> + tmfd->traversed = true;
> + return TM_Updated;
> + }
> + }
> +
> + return result;

Doesn't this mean that the caller can't easily distinguish between
heapam_tuple_delete() and heapam_tuple_lock_internal() returning a failure
state?

> @@ -350,213 +402,8 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
> LockWaitPolicy wait_policy, uint8 flags,
> TM_FailureData *tmfd)
> {

Moving the entire body of the function around, makes it harder to review
this change, because the code movement is intermingled with "actual" changes.

> +/*
> + * This routine does the work for heapam_tuple_lock(), but also support
> + * `updated` to re-use the work done by heapam_tuple_update() or
> + * heapam_tuple_delete() on fetching tuple and checking its visibility.
> + */
> +static TM_Result
> +heapam_tuple_lock_internal(Relation relation, ItemPointer tid, Snapshot snapshot,
> + TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
> + LockWaitPolicy wait_policy, uint8 flags,
> + TM_FailureData *tmfd, bool updated)
> +{
> + BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
> + TM_Result result;
> + Buffer buffer = InvalidBuffer;
> + HeapTuple tuple = &bslot->base.tupdata;
> + bool follow_updates;
> +
> + follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
> + tmfd->traversed = false;
> +
> + Assert(TTS_IS_BUFFERTUPLE(slot));
> +
> +tuple_lock_retry:
> + tuple->t_self = *tid;
> + if (!updated)
> + result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
> + follow_updates, &buffer, tmfd);
> + else
> + result = TM_Updated;

To make sure I understand: You're basically trying to have
heapam_tuple_lock_internal() work as before, except that you want to omit
fetching the first row version, assuming that the caller already tried to lock
it?

I think at the very this needs an assert verifying that the slot actually
contains a tuple in the "updated" path.

> + if (result == TM_Updated &&
> + (flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
> + {
> + if (!updated)
> + {
> + /* Should not encounter speculative tuple on recheck */
> + Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));

Why shouldn't this be checked in the updated case as well?

> @@ -1490,7 +1492,16 @@ ExecDelete(ModifyTableContext *context,
> * transaction-snapshot mode transactions.
> */
> ldelete:
> - result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart);
> +
> + /*
> + * Ask ExecDeleteAct() to immediately place the lock on the updated
> + * tuple if we will need EvalPlanQual() in that case to handle it.
> + */
> + if (!IsolationUsesXactSnapshot())
> + slot = ExecGetReturningSlot(estate, resultRelInfo);
> +
> + result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart,
> + slot);

I don't like that 'slot' is now used for multiple things. I think this could
best be addressed by simply moving the slot variable inside the blocks using
it. And here it should be named more accurately.

Is there a potential conflict with other uses of the ExecGetReturningSlot()?

Given that we now always create the slot, doesn't this increase the overhead
for the very common case of not needing EPQ? We'll create unnecessary slots
all the time, no?

> */
> EvalPlanQualBegin(context->epqstate);
> inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
> resultRelInfo->ri_RangeTableIndex);
> + ExecCopySlot(inputslot, slot);

> - result = table_tuple_lock(resultRelationDesc, tupleid,
> - estate->es_snapshot,
> - inputslot, estate->es_output_cid,
> - LockTupleExclusive, LockWaitBlock,
> - TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> - &context->tmfd);
> + Assert(context->tmfd.traversed);
> + epqslot = EvalPlanQual(context->epqstate,
> + resultRelationDesc,
> + resultRelInfo->ri_RangeTableIndex,
> + inputslot);

The only point of using EvalPlanQualSlot() is to avoid copying the tuple from
one slot to another. Given that we're not benefiting from that anymore (due to
your manual ExecCopySlot() call), it seems we could just pass 'slot' to
EvalPlanQual() and not bother with EvalPlanQualSlot().

> @@ -1449,6 +1451,8 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
> * tmfd - filled in failure cases (see below)
> * changingPart - true iff the tuple is being moved to another partition
> * table due to an update of the partition key. Otherwise, false.
> + * lockedSlot - slot to save the locked tuple if should lock the last row
> + * version during the concurrent update. NULL if not needed.

The grammar in the new comments is off ("if should lock").

I think this is also needs to mention that this *significantly* changes the
behaviour of table_tuple_delete(). That's not at all clear from the comment.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-01-10 01:14:45 Re: Allow +group in pg_ident.conf
Previous Message Andres Freund 2023-01-09 23:38:58 Re: POC: Lock updated tuples in tuple_update() and tuple_delete()