Re: Undo logs

From: Dilip Kumar <dilip(dot)kumar(at)enterprisedb(dot)com>
To: amit(dot)kapila16(at)gmail(dot)com
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Undo logs
Date: 2018-12-12 05:47:40
Message-ID: CAC6YOK61OFuKjVzTKa6msVhSsrs8O+=Pb=0CTZ9oJB_m0obvEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Tue, Dec 4, 2018 at 3:00 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > >
> > > 13.
> > > PrepareUndoInsert()
> > > {
> > > ..
> > > if (!UndoRecPtrIsValid(prepared_urec_ptr))
> > > + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
> > > + else
> > > + urecptr = prepared_urec_ptr;
> > > +
> > > + size = UndoRecordExpectedSize(urec);
> > > ..
> > >
> > > I think we should make above code a bit more bulletproof. As it is
> > > written, there is no guarantee the size we have allocated is same as
> > > we are using in this function.
> > I agree
> > How about if we take 'size' as output
> > > parameter from UndoRecordAllocate and then use it in this function?
> > > Additionally, we can have an Assert that the size returned by
> > > UndoRecordAllocate is same as UndoRecordExpectedSize.
> > With this change we will be able to guarantee when we are allocating
> > single undo record
> > but multi prepare will still be a problem. I haven't fix this as of
> > now. I will think on how
> > to handle both the cases when we have to prepare one time or when we
> > have to allocate
> > once and prepare multiple time.
> >
>
> Yeah, this appears tricky. I think we can leave it as it is unless we
> get some better idea.
>
> 1.
> +void
> +UndoRecordOnUndoLogChange(UndoPersistence persistence)
> +{
> + prev_txid[persistence] = InvalidTransactionId;
> +}
>
> Are we using this API in this or any other patch or in zheap? I see
> this can be useful API, but unless we have a plan to use it, I don't
> think we should retain it.
>
Currently, we are not using it so removed.

2.
> + * Handling multi log:
> + *
> + * It is possible that the undo record of a transaction can be spread
> across
> + * multiple undo log. And, we need some special handling while
> inserting the
> + * undo for discard and rollback to work sanely.
> + *
> + * If the undorecord goes to next log then we insert a transaction
> header for
> + * the first record in the new log and update the transaction header
> with this
> + * new log's location. This will allow us to connect transactions across
> logs
> + * when the same transaction span across log (for this we keep track of
> the
> + * previous logno in undo log meta) which is required to find the latest
> undo
> + * record pointer of the aborted transaction for executing the undo
> actions
> + * before discard. If the next log get processed first in that case we
> + * don't need to trace back the actual start pointer of the transaction,
> + * in such case we can only execute the undo actions from the current log
> + * because the undo pointer in the slot will be rewound and that
> will be enough
> + * to avoid executing same actions. However, there is possibility that
> after
> + * executing the undo actions the undo pointer got discarded, now in
> later
> + * stage while processing the previous log it might try to fetch the undo
> + * record in the discarded log while chasing the transaction header
> chain.
> + * To avoid this situation we first check if the next_urec of the
> transaction
> + * is already discarded then no need to access that and start executing
> from
> + * the last undo record in the current log.
>
> I think I see the problem in the discard mechanism when the log is
> spread across multiple logs. Basically, if the second log contains
> undo of some transaction prior to the transaction which has just
> decided to spread it's undo in the chosen undo log, then we might
> discard the undo log of some transaction(s) inadvertently. Am, I
> missing something?

Actually, I don't see exactly this problem here because we only process one
undo log at a time, so we will not got to the next undo log and discard
some transaction for which we supposed to retain the undo.

> If not, then I guess we need to ensure that we
> don't immediately discard the undo in the second log when a single
> transactions undo is spreaded across two logs
>
> Before choosing a new undo log to span the undo for a transaction, do
> we ensure that it is not already linked with some other undo log for a
> similar reason?
>
> One more thing in this regard:
> +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
> + TransactionId txid, UndoPersistence upersistence)
> {
> ..
> ..
> + if (InRecovery)
> + urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
> + else
> + urecptr = UndoLogAllocate(size, upersistence);
> +
> + log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
> +
> + /*
> + * By now, we must be attached to some undo log unless we are in recovery.
> + */
> + Assert(AmAttachedToUndoLog(log) || InRecovery);
> +
> + /*
> + * We can consider the log as switched if this is the first record of the
> + * log and not the first record of the transaction i.e. same transaction
> + * continued from the previous log.
> + */
> + if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
> + log->meta.prevlogno != InvalidUndoLogNumber)
> + log_switched = true;
> ..
> ..
> }
>
> Isn't there a hidden assumption in the above code that you will always
> get a fresh undo log if the undo doesn't fit in the currently attached
> log? What is the guarantee of same?
>

Yeah it's a problem, we might get the undo which may not be empty. One way
to avoid this could be that instead of relying on the check
"UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some
flag to the undo log meta data that whether it's the first record after
attach or not and based on that we can decide. But, I want to think some
better solution where we can identify without adding anything extra to undo
meta.

3.
> +/*
> + * Call PrepareUndoInsert to tell the undo subsystem about the undo
> record you
> + * intended to insert. Upon return, the necessary undo buffers are
> pinned and
> + * locked.
> + * This should be done before any critical section is established, since
> it
> + * can fail.
> + *
> + * If not in recovery, 'xid' should refer to the top transaction id
> because
> + * undo log only stores mapping for the top most transactions.
> + * If in recovery, 'xid' refers to the transaction id stored in WAL.
> + */
> +extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId
> xid,
> + UndoPersistence);
> +
> +/*
> + * Insert a previously-prepared undo record. This will write the actual
> undo
> + * record into the buffers already pinned and locked in
> PreparedUndoInsert,
> + * and mark them dirty. For persistent undo, this step should be
> performed
> + * after entering a critical section; it should never fail.
> + */
> +extern void InsertPreparedUndo(void);
>
> This text is duplicate of what is mentioned in .c file, so I have
> removed it in delta patch. Similarly, I have removed duplicate text
> atop other functions exposed via undorecord.h
>
> 4.
> +/*
> + * Forget about any previously-prepared undo record. Error recovery calls
> + * this, but it can also be used by other code that changes its mind about
> + * inserting undo after having prepared a record for insertion.
> + */
> +extern void CancelPreparedUndo(void);
>
> This API is nowhere defined or used. What is the intention?
>

Not required

> 5.
> +typedef struct UndoRecordHeader
> +{
> ..
> + /*
> + * Transaction id that has modified the tuple present in this undo record.
> + * If this is older then RecentGlobalXmin, then we can consider the tuple
> + * in this undo record as visible.
> + */
> + TransactionId urec_prevxid;
> ..
>
> /then/than
>
Done

>
> I think we need to mention oldestXidWithEpochHavingUndo instead of
> RecentGlobalXmin.
>
Merged

>
> 6.
> + * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
> + *
> + * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
> + *
> + * When (as will often be the case) multiple structures are present, they
> + * appear in the same order in which the constants are defined here.
> That is,
> + * UndoRecordRelationDetails appears first.
> + */
> +#define UREC_INFO_RELATION_DETAILS 0x01
>
> I have expanded this comments in the attached delta patch.

Merged

> I think we
> should remove the define UREC_INFO_PAYLOAD_CONTAINS_SLOT from the
> patch as this is zheap specific and should be added later along with
> the zheap code.
>

Yeah we can, so removed in my new patch.

7.
> +/*
> + * Additional information about a relation to which this record pertains,
> + * namely the tablespace OID and fork number. If the tablespace OID is
> + * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this
> structure
> + * can (and should) be omitted.
> + */
> +typedef struct UndoRecordRelationDetails
> +{
> + ForkNumber urec_fork; /* fork number */
> +} UndoRecordRelationDetails;
>
> This comment seems to be out-dated, so modified in the attached delta
> patch.
>
Merged

>
> 8.
> +typedef struct UndoRecordTransaction
> +{
> + uint32 urec_progress; /* undo applying progress. */
> + uint32 urec_xidepoch; /* epoch of the current transaction */
>
> Can you expand comments about how the progress is defined and used?
>
Moved your comment from UnpackedUndoRecord to this structure and in
UnpackedUndoRecord I have mentioned that we can refer detailed comment in
this structure.

> Also, write a few sentences about why epoch is captured and or used?

urec_xidepoch is captured mainly for the zheap visibility purpose so isn't
it good that we mention it there?

> 9.
> +#define urec_next_pos \
> + (SizeOfUndoRecordTransaction - SizeOfUrecNext)
>
> What is its purpose?
>

It's not required so removed

>
> 10.
> +/*
> + * Set uur_info for an UnpackedUndoRecord appropriately based on which
> + * other fields are set.
> + */
> +extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
> +
> +/*
> + * Compute the number of bytes of storage that will be required to insert
> + * an undo record. Sets uur->uur_info as a side effect.
> + */
> +extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
>
> Again, I see duplicate text in .h and .c files, so removed this and
> similar comments from .h files. I have tried to move some part of
> comments from .h to .c file, so that it is easier to read from one
> place rather than referring at two places. See, if I have missed
> anything.
>
> Apart from the above, I have done few other cosmetic changes in the
> attached delta patch, see, if you like those, kindly include it in
> the main patch.

Done

Attachment Content-Type Size
0003-undo-interface-v11.patch application/octet-stream 66.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-12-12 05:55:17 Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name
Previous Message Andres Freund 2018-12-12 05:34:58 Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name