Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

From: Andres Freund <andres(at)anarazel(dot)de>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Chris Travers <chris(dot)travers(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Date: 2023-03-23 00:30:03
Message-ID: 20230323003003.plgaxjqahjgkuxrk@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-03-21 01:25:11 +0300, Alexander Korotkov wrote:
> I'm going to push patchset v15 if no objections.

Just saw that this went in - didn't catch up with the thread before,
unfortunately. At the very least I'd like to see some more work on cleaning up
the lazy tuple slot stuff. It's replete with unnecessary multiple-evaluation
hazards - I realize that there's some of those already, but I don't think we
should go further down that route. As far as I can tell there's no need for
any of this to be macros.

> From a8b4e8a7b27815e013ea07b8cc9ac68541a9ac07 Mon Sep 17 00:00:00 2001
> From: Alexander Korotkov <akorotkov(at)postgresql(dot)org>
> Date: Tue, 21 Mar 2023 00:34:15 +0300
> Subject: [PATCH 1/2] Evade extra table_tuple_fetch_row_version() in
> ExecUpdate()/ExecDelete()
>
> When we lock tuple using table_tuple_lock() then we at the same time fetch
> the locked tuple to the slot. In this case we can skip extra
> table_tuple_fetch_row_version() thank to we've already fetched the 'old' tuple
> and nobody can change it concurrently since it's locked.
>
> Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
> Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
> Reviewed-by: Andres Freund, Chris Travers
> ---
> src/backend/executor/nodeModifyTable.c | 48 +++++++++++++++++++-------
> 1 file changed, 35 insertions(+), 13 deletions(-)
>
> diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
> index 3a673895082..93ebfdbb0d8 100644
> --- a/src/backend/executor/nodeModifyTable.c
> +++ b/src/backend/executor/nodeModifyTable.c
> @@ -1559,6 +1559,22 @@ ldelete:
> {
> case TM_Ok:
> Assert(context->tmfd.traversed);
> +
> + /*
> + * Save locked tuple for further processing of
> + * RETURNING clause.
> + */
> + if (processReturning &&
> + resultRelInfo->ri_projectReturning &&
> + !resultRelInfo->ri_FdwRoutine)
> + {
> + TupleTableSlot *returningSlot;
> +
> + returningSlot = ExecGetReturningSlot(estate, resultRelInfo);
> + ExecCopySlot(returningSlot, inputslot);
> + ExecMaterializeSlot(returningSlot);
> + }
> +
> epqslot = EvalPlanQual(context->epqstate,
> resultRelationDesc,
> resultRelInfo->ri_RangeTableIndex,

This seems a bit byzantine. We use inputslot = EvalPlanQualSlot(...) to make
EvalPlanQual() a bit cheaper, because that avoids a slot copy inside
EvalPlanQual(). But now we copy and materialize that slot anyway - and we do
so even if EPQ fails. And we afaics also do it when epqreturnslot is set, in
which case we'll afaics never use the copied slot.

Read the next paragraph below before replying to the above - I don't think
this is right for other reasons:

> @@ -1673,12 +1689,17 @@ ldelete:
> }
> else
> {
> + /*
> + * Tuple can be already fetched to the returning slot in case
> + * we've previously locked it. Fetch the tuple only if the slot
> + * is empty.
> + */
> slot = ExecGetReturningSlot(estate, resultRelInfo);
> if (oldtuple != NULL)
> {
> ExecForceStoreHeapTuple(oldtuple, slot, false);
> }
> - else
> + else if (TupIsNull(slot))
> {
> if (!table_tuple_fetch_row_version(resultRelationDesc, tupleid,
> SnapshotAny, slot))

I don't think this is correct as-is - what if ExecDelete() is called with some
older tuple in the returning slot? If we don't enter the TM_Updated path, it
won't get updated, and we'll return the wrong tuple. It certainly looks
possible to me - consider what happens if a first tuple enter the TM_Updated
path but then fails EvalPlanQual(). If a second tuple is deleted without
entering the TM_Updated path, the wrong tuple will be used for RETURNING.

<plays around with isolationtester>

Yes, indeed. The attached isolationtest breaks with 764da7710bf.

I think it's entirely sensible to avoid the tuple fetching in ExecDelete(),
but it needs a bit less localized work. Instead of using the presence of a
tuple in the returning slot, ExecDelete() should track whether it already has
fetched the deleted tuple.

Or alternatively, do the work to avoid refetching the tuple for the much more
common case of not needing EPQ at all.

I guess this really is part of my issue with this change - it optimizes the
rare case, while not addressing the same inefficiency in the common case.

> @@ -299,14 +305,46 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
> static TM_Result
> heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
> Snapshot snapshot, Snapshot crosscheck, bool wait,
> - TM_FailureData *tmfd, bool changingPart)
> + TM_FailureData *tmfd, bool changingPart,
> + LazyTupleTableSlot *lockedSlot)
> {
> + TM_Result result;
> +
> /*
> * Currently Deleting of index tuples are handled at vacuum, in case if
> * the storage itself is cleaning the dead tuples by itself, it is the
> * time to call the index tuple deletion also.
> */
> - return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
> + result = heap_delete(relation, tid, cid, crosscheck, wait,
> + tmfd, changingPart);
> +
> + /*
> + * If the tuple has been concurrently updated, then get the lock on it.
> + * (Do this if caller asked for tat by providing a 'lockedSlot'.) With the
> + * lock held retry of delete should succeed even if there are more
> + * concurrent update attempts.
> + */
> + if (result == TM_Updated && lockedSlot)
> + {
> + TupleTableSlot *evalSlot;
> +
> + Assert(wait);
> +
> + evalSlot = LAZY_TTS_EVAL(lockedSlot);
> + result = heapam_tuple_lock_internal(relation, tid, snapshot,
> + evalSlot, cid, LockTupleExclusive,
> + LockWaitBlock,
> + TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> + tmfd, true);

Oh, huh? As I mentioned before, the unconditional use of LockWaitBlock means
the wait parameter is ignored.

I'm frankly getting annoyed here.

> +/*
> + * This routine does the work for heapam_tuple_lock(), but also support
> + * `updated` argument to re-use the work done by heapam_tuple_update() or
> + * heapam_tuple_delete() on figuring out that tuple was concurrently updated.
> + */
> +static TM_Result
> +heapam_tuple_lock_internal(Relation relation, ItemPointer tid,
> + Snapshot snapshot, TupleTableSlot *slot,
> + CommandId cid, LockTupleMode mode,
> + LockWaitPolicy wait_policy, uint8 flags,
> + TM_FailureData *tmfd, bool updated)

Why is the new parameter named 'updated'?

> {
> BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
> TM_Result result;
> - Buffer buffer;
> + Buffer buffer = InvalidBuffer;
> HeapTuple tuple = &bslot->base.tupdata;
> bool follow_updates;
>
> @@ -374,16 +455,26 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
>
> tuple_lock_retry:
> tuple->t_self = *tid;
> - result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
> - follow_updates, &buffer, tmfd);
> + if (!updated)
> + result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
> + follow_updates, &buffer, tmfd);
> + else
> + result = TM_Updated;
>
> if (result == TM_Updated &&
> (flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
> {
> - /* Should not encounter speculative tuple on recheck */
> - Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
> + if (!updated)
> + {
> + /* Should not encounter speculative tuple on recheck */
> + Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));

Hm, why is it ok to encounter speculative tuples in the updated case? Oh, I
guess you got failures because slot doesn't point anywhere at this point.

> - ReleaseBuffer(buffer);
> + ReleaseBuffer(buffer);
> + }
> + else
> + {
> + updated = false;
> + }
>
> if (!ItemPointerEquals(&tmfd->ctid, &tuple->t_self))
> {

Which means this is completely bogus now?

HeapTuple tuple = &bslot->base.tupdata;

In the first iteration this just points to the newly created slot. Which
doesn't have a tuple stored in it. So the above checks some uninitialized
memory.

Giving up at this point.

This doesn't seem ready to have been committed.

Greetings,

Andres Freund

Attachment Content-Type Size
0001-epq-delete-returning-spec-test.patch text/x-diff 3.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-03-23 00:38:38 Re: HOT chain validation in verify_heapam()
Previous Message Michael Paquier 2023-03-23 00:24:12 Re: Add n_tup_newpage_upd to pg_stat table views