Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mason Sharp <masonlists(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>
Subject: Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Date: 2023-03-01 14:57:45
Message-ID: CAPpHfdtwKb5UVXKkDQZYW8nQCODy0fY_S7mV5Z+cg7urL=zDEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Andres.

Thank you for your review. Sorry for the late reply. I took some
time for me to figure out how to revise the patch.

The revised patchset is attached. I decided to split the patch into two:
1) Avoid re-fetching the "original" row version during update and delete.
2) Save the efforts by re-using existing context of
tuple_update()/tuple_delete() for locking the tuple.
They are two separate optimizations. So let's evaluate their
performance separately.

On Tue, Jan 10, 2023 at 4:07 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> I'm a bit worried that this is optimizing the rare case while hurting the
> common case. See e.g. my point below about creating additional slots in the
> happy path.

This makes sense. It worth to allocate the slot only if we're going
to store a tuple there. I implemented this by passing a callback for
slot allocation instead of the slot.

> It's also not clear that change is right directionally. If we want to avoid
> re-fetching the "original" row version, why don't we provide that
> functionality via table_tuple_lock()?

These are two distinct optimizations. Now, they come as two distinct patches.

> On 2023-01-09 13:29:18 +0300, Alexander Korotkov wrote:
> > @@ -53,6 +53,12 @@ static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
> > HeapTuple tuple,
> > OffsetNumber tupoffset);
> >
> > +static TM_Result heapam_tuple_lock_internal(Relation relation, ItemPointer tid,
> > + Snapshot snapshot, TupleTableSlot *slot,
> > + CommandId cid, LockTupleMode mode,
> > + LockWaitPolicy wait_policy, uint8 flags,
> > + TM_FailureData *tmfd, bool updated);
> > +
> > static BlockNumber heapam_scan_get_blocks_done(HeapScanDesc hscan);
> >
> > static const TableAmRoutine heapam_methods;
> > @@ -299,14 +305,39 @@ heapam_tuple_complete_speculative(Relation relation, TupleTableSlot *slot,
> > static TM_Result
> > heapam_tuple_delete(Relation relation, ItemPointer tid, CommandId cid,
> > Snapshot snapshot, Snapshot crosscheck, bool wait,
> > - TM_FailureData *tmfd, bool changingPart)
> > + TM_FailureData *tmfd, bool changingPart,
> > + TupleTableSlot *lockedSlot)
> > {
> > + TM_Result result;
> > +
> > /*
> > * Currently Deleting of index tuples are handled at vacuum, in case if
> > * the storage itself is cleaning the dead tuples by itself, it is the
> > * time to call the index tuple deletion also.
> > */
> > - return heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
> > + result = heap_delete(relation, tid, cid, crosscheck, wait, tmfd, changingPart);
> > +
> > + /*
> > + * If the tuple has been concurrently updated, get lock already so that on
> > + * retry it will succeed, provided that the caller asked to do this by
> > + * providing a lockedSlot.
> > + */
> > + if (result == TM_Updated && lockedSlot != NULL)
> > + {
> > + result = heapam_tuple_lock_internal(relation, tid, snapshot,
> > + lockedSlot, cid, LockTupleExclusive,
> > + LockWaitBlock,
> > + TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> > + tmfd, true);
>
> You're ignoring the 'wait' parameter here, no? I think the modification to
> heapam_tuple_update() has the same issue.

Yep. I didn't catch this, because currently we also call
tuple_update()/tuple_delete() with wait == true. Fixed.

> > + if (result == TM_Ok)
> > + {
> > + tmfd->traversed = true;
> > + return TM_Updated;
> > + }
> > + }
> > +
> > + return result;
>
> Doesn't this mean that the caller can't easily distinguish between
> heapam_tuple_delete() and heapam_tuple_lock_internal() returning a failure
> state?

Exactly. But currently nodeModifyTable.c handles these failure states
in the similar way. And I don't see why it should be different in
future.

> > @@ -350,213 +402,8 @@ heapam_tuple_lock(Relation relation, ItemPointer tid, Snapshot snapshot,
> > LockWaitPolicy wait_policy, uint8 flags,
> > TM_FailureData *tmfd)
> > {
>
> Moving the entire body of the function around, makes it harder to review
> this change, because the code movement is intermingled with "actual" changes.

OK, fixed.

> > +/*
> > + * This routine does the work for heapam_tuple_lock(), but also support
> > + * `updated` to re-use the work done by heapam_tuple_update() or
> > + * heapam_tuple_delete() on fetching tuple and checking its visibility.
> > + */
> > +static TM_Result
> > +heapam_tuple_lock_internal(Relation relation, ItemPointer tid, Snapshot snapshot,
> > + TupleTableSlot *slot, CommandId cid, LockTupleMode mode,
> > + LockWaitPolicy wait_policy, uint8 flags,
> > + TM_FailureData *tmfd, bool updated)
> > +{
> > + BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
> > + TM_Result result;
> > + Buffer buffer = InvalidBuffer;
> > + HeapTuple tuple = &bslot->base.tupdata;
> > + bool follow_updates;
> > +
> > + follow_updates = (flags & TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS) != 0;
> > + tmfd->traversed = false;
> > +
> > + Assert(TTS_IS_BUFFERTUPLE(slot));
> > +
> > +tuple_lock_retry:
> > + tuple->t_self = *tid;
> > + if (!updated)
> > + result = heap_lock_tuple(relation, tuple, cid, mode, wait_policy,
> > + follow_updates, &buffer, tmfd);
> > + else
> > + result = TM_Updated;
>
> To make sure I understand: You're basically trying to have
> heapam_tuple_lock_internal() work as before, except that you want to omit
> fetching the first row version, assuming that the caller already tried to lock
> it?
>
> I think at the very this needs an assert verifying that the slot actually
> contains a tuple in the "updated" path.

This part was re-written.

> > + if (result == TM_Updated &&
> > + (flags & TUPLE_LOCK_FLAG_FIND_LAST_VERSION))
> > + {
> > + if (!updated)
> > + {
> > + /* Should not encounter speculative tuple on recheck */
> > + Assert(!HeapTupleHeaderIsSpeculative(tuple->t_data));
>
> Why shouldn't this be checked in the updated case as well?
>
>
> > @@ -1490,7 +1492,16 @@ ExecDelete(ModifyTableContext *context,
> > * transaction-snapshot mode transactions.
> > */
> > ldelete:
> > - result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart);
> > +
> > + /*
> > + * Ask ExecDeleteAct() to immediately place the lock on the updated
> > + * tuple if we will need EvalPlanQual() in that case to handle it.
> > + */
> > + if (!IsolationUsesXactSnapshot())
> > + slot = ExecGetReturningSlot(estate, resultRelInfo);
> > +
> > + result = ExecDeleteAct(context, resultRelInfo, tupleid, changingPart,
> > + slot);
>
> I don't like that 'slot' is now used for multiple things. I think this could
> best be addressed by simply moving the slot variable inside the blocks using
> it. And here it should be named more accurately.

I didn't do that refactoring. But now editing introduced by the 1st
patch of the set are more granular and doesn't affect usage of the
'slot' variable.

> Is there a potential conflict with other uses of the ExecGetReturningSlot()?

Yep. The current revision evades this random usage of slots.

> Given that we now always create the slot, doesn't this increase the overhead
> for the very common case of not needing EPQ? We'll create unnecessary slots
> all the time, no?

Yes, this is addressed by allocating EPQ slot only once it is needed
via callback. I'm thinking about wrapping this into some abstraction
called 'LazySlot'.

> > */
> > EvalPlanQualBegin(context->epqstate);
> > inputslot = EvalPlanQualSlot(context->epqstate, resultRelationDesc,
> > resultRelInfo->ri_RangeTableIndex);
> > + ExecCopySlot(inputslot, slot);
>
> > - result = table_tuple_lock(resultRelationDesc, tupleid,
> > - estate->es_snapshot,
> > - inputslot, estate->es_output_cid,
> > - LockTupleExclusive, LockWaitBlock,
> > - TUPLE_LOCK_FLAG_FIND_LAST_VERSION,
> > - &context->tmfd);
> > + Assert(context->tmfd.traversed);
> > + epqslot = EvalPlanQual(context->epqstate,
> > + resultRelationDesc,
> > + resultRelInfo->ri_RangeTableIndex,
> > + inputslot);
>
> The only point of using EvalPlanQualSlot() is to avoid copying the tuple from
> one slot to another. Given that we're not benefiting from that anymore (due to
> your manual ExecCopySlot() call), it seems we could just pass 'slot' to
> EvalPlanQual() and not bother with EvalPlanQualSlot().

This makes sense. Now, usage pattern of the slots is more clear.

> > @@ -1449,6 +1451,8 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
> > * tmfd - filled in failure cases (see below)
> > * changingPart - true iff the tuple is being moved to another partition
> > * table due to an update of the partition key. Otherwise, false.
> > + * lockedSlot - slot to save the locked tuple if should lock the last row
> > + * version during the concurrent update. NULL if not needed.
>
> The grammar in the new comments is off ("if should lock").
>
> I think this is also needs to mention that this *significantly* changes the
> behaviour of table_tuple_delete(). That's not at all clear from the comment.

Let's see the performance results for the patchset. I'll properly
revise the comments if results will be good.

Pavel, could you please re-run your tests over revised patchset?

------
Regards,
Alexander Korotkov

Attachment Content-Type Size
0001-Evade-extra-table_tuple_fetch_row_version-in-Exe-v11.patch application/octet-stream 3.4 KB
0002-Allow-locking-updated-tuples-in-tuple_update-and-v11.patch application/octet-stream 33.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-03-01 15:00:00 Re: Combine pg_walinspect till_end_of_wal functions with others
Previous Message Tomas Vondra 2023-03-01 14:33:00 Re: Add LZ4 compression in pg_dump