Re: Undo logs

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Dilip Kumar <dilip(dot)kumar(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Undo logs
Date: 2018-12-04 09:30:44
Message-ID: CAFiTN-s112o6KLi_gVX=b-vO9t_2_ufxxRxG6Ff1iDx5Knvohw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Nov 29, 2018 at 6:27 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > 10.
> > > + if (!UndoRecPtrIsValid(multi_prep_urp))
> > > + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
> > > + else
> > > + urecptr = multi_prep_urp;
> > > +
> > > + size = UndoRecordExpectedSize(urec);
> > > ..
> > > ..
> > > + if (UndoRecPtrIsValid(multi_prep_urp))
> > > + {
> > > + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
> > > + insert = UndoLogOffsetPlusUsableBytes(insert, size);
> > > + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
> > > + }
> > >
> > > Can't we use urecptr instead of multi_prep_urp in above code after
> > > urecptr is initialized?
> >
> > I think we can't because urecptr is just the current pointer we are
> > going to return but multi_prep_urp is the static variable we need to
> > update so that
> > the next prepare can calculate urecptr from this location.
> > >
>
> Okay, but that was not apparent from the code, so added a comment in
> the attached delta patch. BTW, wouldn't it be better to move this
> code to the end of function once prepare for current record is
> complete.
>
> More comments
> ----------------------------
> 1.
> * We can consider that the log as switched if
>
> /that/ needs to be removed.
>
> 2.
> + if (prepare_idx == max_prepared_undo)
> + elog(ERROR, "Already reached the maximum prepared limit.");
>
> a. /Already/already
> b. we don't use full-stop (.) in error

Merged your change
>
> 3.
> + * also stores the dbid and the progress of the undo apply during rollback.
>
> /the progress/ extra space.
>
Merged your change
> 4.
> +UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
> + TransactionId xid, UndoPersistence upersistence)
> +{
>
> nrecords should be a second parameter.
Merged your change
>
> 5.
> +UndoRecPtr
> +PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
> + TransactionId xid)
>
> It seems better to have xid parameter before UndoPersistence.
Merged your change. Done same changes in UndoRecordAllocate as well
>
> 6.
> + /* FIXME: Should we just report error ? */
> + Assert(index < MAX_BUFFER_PER_UNDO);
>
> No need of this Fixme.
Merged your change
>
> 7.
> PrepareUndoInsert()
> {
> ..
> do
> {
> ..
> + /* Undo record can not fit into this block so go to the next block. */
> + cur_blk++;
> ..
> } while (cur_size < size);
> ..
> }
>
> This comment was making me uneasy, so slightly adjusted the code.
> Basically, by that time it was not decided whether undo record can fit
> in current buffer or not.
Merged your change
>
> 8.
> + /*
> + * Save referenced of undo record pointer as well as undo record.
> + * InsertPreparedUndo will use these to insert the prepared record.
> + */
> + prepared_undo[prepare_idx].urec = urec;
> + prepared_undo[prepare_idx].urp = urecptr;
>
> Slightly adjust the above comment.
Merged your change
>
> 9.
> +InsertPreparedUndo(void)
> {
> ..
> + Assert(prepare_idx > 0);
> +
> + /* This must be called under a critical section. */
> + Assert(InRecovery || CritSectionCount > 0);
> ..
> }
>
> I have added a few more comments for above assertions, see if those are correct.

Merged your change
>
> 10.
> +InsertPreparedUndo(void)
> {
> ..
> + prev_undolen = undo_len;
> +
> + UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
> ..
> }
>
> There is no need to use an additional variable prev_undolen in the
> above code. I have modified the code to remove it's usage, check if
> that is correct.

looks fine to me.
>
> 11.
> + * Fetch the next undo record for given blkno, offset and transaction id (if
> + * valid). We need to match transaction id along with block number and offset
> + * because in some cases (like reuse of slot for committed transaction), we
> + * need to skip the record if it is modified by a transaction later than the
> + * transaction indicated by previous undo record. For example, consider a
> + * case where tuple (ctid - 0,1) is modified by transaction id 500 which
> + * belongs to transaction slot 0. Then, the same tuple is modified by
> + * transaction id 501 which belongs to transaction slot 1. Then, both the
> + * transaction slots are marked for reuse. Then, again the same tuple is
> + * modified by transaction id 502 which has used slot 0. Now, some
> + * transaction which has started before transaction 500 wants to traverse the
> + * chain to find visible tuple will keep on rotating infinitely between undo
> + * tuple written by 502 and 501. In such a case, we need to skip the undo
> + * tuple written by transaction 502 when we want to find the undo record
> + * indicated by the previous pointer of undo tuple written by transaction 501.
> + * Start the search from urp. Caller need to call UndoRecordRelease
> to release the
> + * resources allocated by this function.
> + *
> + * urec_ptr_out is undo record pointer of the qualified undo record if valid
> + * pointer is passed.
> + */
> +UnpackedUndoRecord *
> +UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
> + TransactionId xid, UndoRecPtr * urec_ptr_out,
> + SatisfyUndoRecordCallback callback)
>
>
> The comment above UndoFetchRecord is very zheap specific, so I have
> tried to simplify it. I think we can give so much detailed examples
> only when we introduce zheap code.

Make sense.
>
> Apart from above, there are miscellaneous comments and minor code
> edits in the attached delta patch.

I have merged your changes.
>
> 12.
> PrepareUndoInsert()
> {
> ..
> + /*
> + * If this is the first undo record for this top transaction add the
> + * transaction information to the undo record.
> + *
> + * XXX there is also an option that instead of adding the information to
> + * this record we can prepare a new record which only contain transaction
> + * informations.
> + */
> + if (xid == InvalidTransactionId)
>
> The above comment seems to be out of place, we are doing nothing like
> that here. This work is done in UndoRecordAllocate, may be you can
> move 'XXX ..' part of the comment in that function.

Done
>
> 13.
> PrepareUndoInsert()
> {
> ..
> if (!UndoRecPtrIsValid(prepared_urec_ptr))
> + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
> + else
> + urecptr = prepared_urec_ptr;
> +
> + size = UndoRecordExpectedSize(urec);
> ..
>
> I think we should make above code a bit more bulletproof. As it is
> written, there is no guarantee the size we have allocated is same as
> we are using in this function.
I agree
How about if we take 'size' as output
> parameter from UndoRecordAllocate and then use it in this function?
> Additionally, we can have an Assert that the size returned by
> UndoRecordAllocate is same as UndoRecordExpectedSize.
With this change we will be able to guarantee when we are allocating
single undo record
but multi prepare will still be a problem. I haven't fix this as of
now. I will think on how
to handle both the cases when we have to prepare one time or when we
have to allocate
once and prepare multiple time.

>
> 14.
> +
> +void
> +RegisterUndoLogBuffers(uint8 first_block_id)
> +{
> + int idx;
> + int flags;
> +
> + for (idx = 0; idx < buffer_idx; idx++)
> + {
> + flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
> + XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
> + }
> +}
> +
> +void
> +UndoLogBuffersSetLSN(XLogRecPtr recptr)
> +{
> + int idx;
> +
> + for (idx = 0; idx < buffer_idx; idx++)
> + PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
> +}
>
> One line comment atop of these functions will be good. It will be
> better if we place these functions at the end of file or someplace
> else, as right now they are between prepare* and insert* function
> calls which makes code flow bit awkward.
Done
>
> 15.
> + * and mark them dirty. For persistent undo, this step should be performed
> + * after entering a critical section; it should never fail.
> + */
> +void
> +InsertPreparedUndo(void)
> +{
>
> Why only for persistent undo this step should be performed in the
> critical section? I think as this function operates on shred buffer,
> even for unlogged undo, it should be done in a critical section.

I think we can remove this comment? I have removed that in the current patch.
>
> 16.
> +InsertPreparedUndo(void)
> {
> ..
> /* if starting a new log then there is no prevlen to store */
> + if (offset == UndoLogBlockHeaderSize)
> + {
> + if (log->meta.prevlogno != InvalidUndoLogNumber)
> + {
> + UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
> +
> + uur->uur_prevlen = prevlog->meta.prevlen;
> + }
> ..
> }
>
> The comment here suggests that for new logs, we don't need prevlen,
> but still, in one case you are maintaining the length, can you add few
> comments to explain why?

Done
>
> 17.
> +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
> + UndoPersistence persistence)
> {
> ..
> + /*
> + * FIXME: This can be optimized to just fetch header first and only if
> + * matches with block number and offset then fetch the complete
> + * record.
> + */
> + if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
> + break;
> ..
> }
>
> I don't know how much it matters if we fetch complete record or just
> it's header unless the record is big or it falls in two pages. I
> think both are boundary cases and I couldn't see this part much in
> perf profiles. There is nothing to fix here if you want you can a XXX
> comment or maybe suggest it as a future optimization.

Changed
>
> 18.
> +UndoFetchRecord()
> {
> ...
> + /*
> + * Prevent UndoDiscardOneLog() from discarding data while we try to
> + * read it. Usually we would acquire log->mutex to read log->meta
> + * members, but in this case we know that discard can't move without
> + * also holding log->discard_lock.
> + */
> + LWLockAcquire(&log->discard_lock, LW_SHARED);
> + if (!UndoRecPtrIsValid(log->oldest_data))
> + {
> + /*
> + * UndoDiscardInfo is not yet initialized. Hence, we've to check
> + * UndoLogIsDiscarded and if it's already discarded then we have
> + * nothing to do.
> + */
> + LWLockRelease(&log->discard_lock);
> + if (UndoLogIsDiscarded(urp))
> + {
> + if (BufferIsValid(urec->uur_buffer))
> + ReleaseBuffer(urec->uur_buffer);
> + return NULL;
> + }
> +
> + LWLockAcquire(&log->discard_lock, LW_SHARED);
> + }
> +
> + /* Check if it's already discarded. */
> + if (urp < log->oldest_data)
> + {
> + LWLockRelease(&log->discard_lock);
> + if (BufferIsValid(urec->uur_buffer))
> + ReleaseBuffer(urec->uur_buffer);
> + return NULL;
> + }
> ..
> }
>
> Can't we replace this logic with UndoRecordIsValid?

Done
>
> 19.
> UndoFetchRecord()
> {
> ..
> while(true)
> {
> ..
> /*
> + * If we have a valid buffer pinned then just ensure that we want to
> + * find the next tuple from the same block. Otherwise release the
> + * buffer and set it invalid
> + */
> + if (BufferIsValid(urec->uur_buffer))
> + {
> + /*
> + * Undo buffer will be changed if the next undo record belongs to
> + * a different block or undo log.
> + */
> + if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
> + (prevrnode.relNode != rnode.relNode))
> + {
> + ReleaseBuffer(urec->uur_buffer);
> + urec->uur_buffer = InvalidBuffer;
> + }
> + }
> + else
> + {
> + /*
> + * If there is not a valid buffer in urec->uur_buffer that means
> + * we had copied the payload data and tuple data so free them.
> + */
> + if (urec->uur_payload.data)
> + pfree(urec->uur_payload.data);
> + if (urec->uur_tuple.data)
> + pfree(urec->uur_tuple.data);
> + }
> +
> + /* Reset the urec before fetching the tuple */
> + urec->uur_tuple.data = NULL;
> + urec->uur_tuple.len = 0;
> + urec->uur_payload.data = NULL;
> + urec->uur_payload.len = 0;
> + prevrnode = rnode;
> ..
> }
>
> Can't we move this logic after getting a record with UndoGetOneRecord
> and matching with a callback? This is certainly required after the
> first record, so it looks a bit odd here. Also, if possible can we
> move it to a separate function as this is not the main logic and makes
> the main logic difficult to follow.
>
Fixed

Apart from fixing these comments I have also rebased Thomas' undo log
patches on the current head.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
0001-Add-undo-log-manager_v3.patch application/x-patch 115.4 KB
0002-Provide-access-to-undo-log-data-via-the-buffer-manag_v3.patch application/x-patch 40.3 KB
0003-undo-interface-v10.patch application/x-patch 69.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2018-12-04 09:43:55 Re: Undo worker and transaction rollback
Previous Message Magnus Hagander 2018-12-04 08:25:09 Re: error message when subscription target is a partitioned table