Re: POC: Cleaning up orphaned files using undo logs

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-07 09:20:17
Message-ID: CAA4eK1KnnWHKiPT_JYV=XGXPHZGdZCtvH5WqcoRoyBvKsEqDjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
>

I am responding to some of the points where I need some more inputs or
some discussion is required. Some of the things need more thoughts
which I will respond later and some are quite straight forward and
doesn't need much discussion.

>
> > +/*
> > + * Binary heap comparison function to compare the time at which an error
> > + * occurred for transactions.
> > + *
> > + * The error queue is sorted by next_retry_at and err_occurred_at. Currently,
> > + * the next_retry_at has some constant delay time (see PushErrorQueueElem), so
> > + * it doesn't make much sense to sort by both values. However, in future, if
> > + * we have some different algorithm for next_retry_at, then it will work
> > + * seamlessly.
> > + */
>
> Why is it useful to have error_occurred_at be part of the comparison at
> all? If we need a tiebraker, err_occurred_at isn't that (if we can get
> conflicts for next_retry_at, then we can also get conflicts in
> err_occurred_at). Seems better to use something actually guaranteed to
> be unique for a tiebreaker.
>

This was to distinguish the cases where the request is failing
multiple times with the request failed the first time. I agree that
we need a better unique identifier like FullTransactionid though. Do
let me know if you have any other suggestion?

>
>
> > + /*
> > + * The rollback hash table is used to avoid duplicate undo requests by
> > + * backends and discard worker. The table must be able to accomodate all
> > + * active undo requests. The undo requests must appear in both xid and
> > + * size requests queues or neither. In same transaction, there can be two
> > + * requests one for logged relations and another for unlogged relations.
> > + * So, the rollback hash table size should be equal to two request queues,
> > + * an error queue (currently this is same as request queue) and max
>
> "the same"? I assume this intended to mean the same size?
>

Yes. I will add the word size to be more clear.

>
> > + * backends. This will ensure that it won't get filled.
> > + */
>
> How does this ensure anything?
>

Because based on this we will have a hard limit on the number of undo
requests after which we won't allow more requests. See some more
detailed explanation for the same later in this email. I think the
comment needs to be updated.

>
> > + * the binary heaps which can change.
> > + */
> > + Assert(LWLockHeldByMeInMode(RollbackRequestLock, LW_EXCLUSIVE));
> > +
> > + /*
> > + * We normally push the rollback request to undo workers if the size of
> > + * same is above a certain threshold.
> > + */
> > + if (req_size >= rollback_overflow_size * 1024 * 1024)
> > + {
>
> Why is this being checked with the lock held? Seems like this should be
> handled in a pre-check?
>

Yeah, it can be a pre-check, but I thought it is better to encapsulate
everything in the function as this is not an expensive check. I think
we can move it to outside lock to avoid any such confusion.

>
> > + * allow_peek - if true, peeks a few element from each queue to check whether
> > + * any request matches current dbid.
> > + * remove_from_queue - if true, picks an element from the queue whose dbid
> > + * matches current dbid and remove it from the queue before returning the same
> > + * to caller.
> > + * urinfo - this is an OUT parameter that returns the details of undo request
> > + * whose undo action is still pending.
> > + * in_other_db_out - this is an OUT parameter. If we've not found any work
> > + * for current database, but there is work for some other database, we set
> > + * this parameter as true.
> > + */
> > +bool
> > +UndoGetWork(bool allow_peek, bool remove_from_queue, UndoRequestInfo *urinfo,
> > + bool *in_other_db_out)
> > +{
>
>
> > + /*
> > + * If some undo worker is already processing the rollback request or
> > + * it is already processed, then we drop that request from the queue
> > + * and fetch the next entry from the queue.
> > + */
> > + if (!rh || UndoRequestIsInProgress(rh))
> > + {
> > + RemoveRequestFromQueue(cur_queue, 0);
> > + cur_undo_queue++;
> > + continue;
> > + }
>
> When is it possible to hit the in-progress case?
>

The same request is in two queues. It is possible that when the
request is being processed from xid queue by one of the workers, the
request from another queue is picked by another worker. I think this
case won't exist after making rbtree based queues.

> > +/*
> > + * UpdateUndoApplyProgress - Updates how far undo actions from a particular
> > + * log have been applied while rolling back a transaction. This progress is
> > + * measured in terms of undo block number of the undo log till which the
> > + * undo actions have been applied.
> > + */
> > +static void
> > +UpdateUndoApplyProgress(UndoRecPtr progress_urec_ptr,
> > + BlockNumber block_num)
> > +{
> > + UndoLogCategory category;
> > + UndoRecordInsertContext context = {{0}};
> > +
> > + category =
> > + UndoLogNumberGetCategory(UndoRecPtrGetLogNo(progress_urec_ptr));
> > +
> > + /*
> > + * We don't need to update the progress for temp tables as they get
> > + * discraded after startup.
> > + */
> > + if (category == UNDO_TEMP)
> > + return;
> > +
> > + BeginUndoRecordInsert(&context, category, 1, NULL);
> > +
> > + /*
> > + * Prepare and update the undo apply progress in the transaction header.
> > + */
> > + UndoRecordPrepareApplyProgress(&context, progress_urec_ptr, block_num);
> > +
> > + START_CRIT_SECTION();
> > +
> > + /* Update the progress in the transaction header. */
> > + UndoRecordUpdateTransInfo(&context, 0);
> > +
> > + /* WAL log the undo apply progress. */
> > + {
> > + XLogRecPtr lsn;
> > + xl_undoapply_progress xlrec;
> > +
> > + xlrec.urec_ptr = progress_urec_ptr;
> > + xlrec.progress = block_num;
> > +
> > + XLogBeginInsert();
> > + XLogRegisterData((char *) &xlrec, sizeof(xlrec));
> > +
> > + RegisterUndoLogBuffers(&context, 1);
> > + lsn = XLogInsert(RM_UNDOACTION_ID, XLOG_UNDO_APPLY_PROGRESS);
> > + UndoLogBuffersSetLSN(&context, lsn);
> > + }
> > +
> > + END_CRIT_SECTION();
> > +
> > + /* Release undo buffers. */
> > + FinishUndoRecordInsert(&context);
> > +}
>
> This whole prepare/execute split for updating apply pregress, and next
> undo pointers makes no sense to me.
>

Can you explain what is your concern here? Basically, in the prepare
phase, we do read and lock the buffer and in the actual update phase
(which is under critical section), we update the contents in the
shared buffer. This is the same idea as we use in many places in
code.

>
> > typedef struct TwoPhaseFileHeader
> > {
> > @@ -927,6 +928,16 @@ typedef struct TwoPhaseFileHeader
> > uint16 gidlen; /* length of the GID - GID follows the header */
> > XLogRecPtr origin_lsn; /* lsn of this record at origin node */
> > TimestampTz origin_timestamp; /* time of prepare at origin node */
> > +
> > + /*
> > + * We need the locations of the start and end undo record pointers when
> > + * rollbacks are to be performed for prepared transactions using undo-based
> > + * relations. We need to store this information in the file as the user
> > + * might rollback the prepared transaction after recovery and for that we
> > + * need its start and end undo locations.
> > + */
> > + UndoRecPtr start_urec_ptr[UndoLogCategories];
> > + UndoRecPtr end_urec_ptr[UndoLogCategories];
> > } TwoPhaseFileHeader;
>
> Why do we not need that knowledge for undo processing of a non-prepared
> transaction?
>

The non-prepared transaction also needs to be aware of that. It is
stored in TransactionStateData. I am not sure if I understand your
question here.

>
> > + * applying undo via top-level transaction, if we get an error,
> > + * then it is handled by ReleaseResourcesAndProcessUndo
>
> Where and how does it handle that? Maybe I misunderstand what you mean?
>

It is handled in ProcessUndoRequestForEachLogCat which is called from
ReleaseResourcesAndProcessUndo. Basically, the error is handled in
catch and we insert the request in error queue. The function name
should be changed in comments.

>
> > + case TBLOCK_UNDO:
> > + /*
> > + * We reach here when we got error while applying undo
> > + * actions, so we don't want to again start applying it. Undo
> > + * workers can take care of it.
> > + *
> > + * AbortTransaction is already done, still need to release
> > + * locks and perform cleanup.
> > + */
> > + ResetUndoActionsInfo();
> > + ResourceOwnerRelease(s->curTransactionOwner,
> > + RESOURCE_RELEASE_LOCKS,
> > + false,
> > + true);
> > + s->state = TRANS_ABORT;
> > CleanupTransaction();
>
> Hm. Why is it ok that we only perform that cleanup action? Either the
> rest of potentially held resources will get cleaned up somehow as well,
> in which case this ResourceOwnerRelease() ought to be redundant, or
> we're potentially leaking important resources like buffer pins, relcache
> references and whatnot here?
>

I had initially used AbortTransaction() here for such things, but I
was not sure whether that is the right thing when we reach here in
this state. Because AbortTransaction is already done once we reach
here. The similar thing happens for the TBLOCK_SUBUNDO state few
lines below where I had used AbortSubTransaction. Now, one problem I
faced when AbortSubTransaction got invoked in this code path was it
internally invokes RecordTransactionAbort->XidCacheRemoveRunningXids
which result in the error "did not find subXID %u in MyProc". The
reason is obvious which is that we had already removed it when
AbortSubTransaction was invoked before applying undo actions. The
releasing of locks was the thing which we have delayed to allow undo
actions to be applied which is done here. The other idea here I had
was to call AbortTransaction/AbortSubTransaction but somehow avoid
calling RecordTransactionAbort when in this state. Do you have any
suggestion to deal with this?

>
> > +{
> > + TransactionState s = CurrentTransactionState;
> > + bool result;
> > + int i;
> > +
> > + /*
> > + * We don't want to apply the undo actions when we are already cleaning up
> > + * for FATAL error. See ReleaseResourcesAndProcessUndo.
> > + */
> > + if (SemiCritSectionCount > 0)
> > + {
> > + ResetUndoActionsInfo();
> > + return;
> > + }
>
> Wait what? Semi critical sections?
>

Robert up thread suggested this idea [1] (See paragraph starting with
"I am not a fan of applying_subxact_undo....") to deal with cases
where we get an error while applying undo actions and we need to
promote the error to FATAL. We have two such cases as of now in this
patch, one is when we process temp log category log and other is when
we are rolling back sub-transactions. The detailed reasons are
mentioned in function execute_undo_actions. I think this can be used
for other things as well in the future.

>
>
> > + for (i = 0; i < UndoLogCategories; i++)
> > + {
> > + /*
> > + * We can't push the undo actions for temp table to background
> > + * workers as the the temp tables are only accessible in the
> > + * backend that has created them.
> > + */
> > + if (i != UNDO_TEMP && UndoRecPtrIsValid(s->latestUrecPtr[i]))
> > + {
> > + result = RegisterUndoRequest(s->latestUrecPtr[i],
> > + s->startUrecPtr[i],
> > + MyDatabaseId,
> > + GetTopFullTransactionId());
> > + s->undoRequestResgistered[i] = result;
> > + }
> > + }
>
> Give code like this I have a hard time seing what the point of having
> separate queue entries for the different persistency levels is.
>

It is not for this case, rather, it is for the case of discard worker
(background worker) where we process the transactions at log level.
The permanent and unlogged transactions will be in a separate log and
can be encountered at different times, so this leads to having
separate entries for them. I am planning to give a try to unify them
based on some of the discussions in this email chain.

>
> > +void
> > +ReleaseResourcesAndProcessUndo(void)
> > +{
> > + TransactionState s = CurrentTransactionState;
> > +
> > + /*
> > + * We don't want to apply the undo actions when we are already cleaning up
> > + * for FATAL error. One of the main reasons is that we might be already
> > + * processing undo actions for a (sub)transaction when we reach here
> > + * (for ex. error happens while processing undo actions for a
> > + * subtransaction).
> > + */
> > + if (SemiCritSectionCount > 0)
> > + {
> > + ResetUndoActionsInfo();
> > + return;
> > + }
> > +
> > + if (!NeedToPerformUndoActions())
> > + return;
> > +
> > + /*
> > + * State should still be TRANS_ABORT from AbortTransaction().
> > + */
> > + if (s->state != TRANS_ABORT)
> > + elog(FATAL, "ReleaseResourcesAndProcessUndo: unexpected state %s",
> > + TransStateAsString(s->state));
> > +
> > + /*
> > + * Do abort cleanup processing before applying the undo actions. We must
> > + * do this before applying the undo actions to remove the effects of
> > + * failed transaction.
> > + */
> > + if (IsSubTransaction())
> > + {
> > + AtSubCleanup_Portals(s->subTransactionId);
> > + s->blockState = TBLOCK_SUBUNDO;
> > + }
> > + else
> > + {
> > + AtCleanup_Portals(); /* now safe to release portal memory */
> > + AtEOXact_Snapshot(false, true); /* and release the transaction's
> > + * snapshots */
>
> Why do precisely these actions need to be performed here?
>

This is to get a transaction into a clean state. Before calling this
function AbortTransaction has been performed and there were few more
things we need to do for cleanup.

>
> > + s->fullTransactionId = InvalidFullTransactionId;
> > + s->subTransactionId = TopSubTransactionId;
> > + s->blockState = TBLOCK_UNDO;
> > + }
> > +
> > + s->state = TRANS_UNDO;
>
> This seems guaranteed to constantly be out of date with other
> modifications of the commit/abort sequence.
>

It is similar to how we change state in Abort(Sub)Transaction and we
change the state back to TRANS_ABORT after applying undo in this
function. So not sure, how it can be out-of-date. Do you have any
better suggestion here?

>
>
> > +bool
> > +ProcessUndoRequestForEachLogCat(FullTransactionId fxid, Oid dbid,
> > + UndoRecPtr *end_urec_ptr, UndoRecPtr *start_urec_ptr,
> > + bool *undoRequestResgistered, bool isSubTrans)
> > +{
> > + UndoRequestInfo urinfo;
> > + int i;
> > + uint32 save_holdoff;
> > + bool success = true;
> > +
> > + for (i = 0; i < UndoLogCategories; i++)
> > + {
> > + if (end_urec_ptr[i] && !undoRequestResgistered[i])
> > + {
> > + save_holdoff = InterruptHoldoffCount;
> > +
> > + PG_TRY();
> > + {
> > + /* for subtransactions, we do partial rollback. */
> > + execute_undo_actions(fxid,
> > + end_urec_ptr[i],
> > + start_urec_ptr[i],
> > + !isSubTrans);
> > + }
> > + PG_CATCH();
> > + {
> > + /*
> > + * Add the request into an error queue so that it can be
> > + * processed in a timely fashion.
> > + *
> > + * If we fail to add the request in an error queue, then mark
> > + * the entry status as invalid and continue to process the
> > + * remaining undo requests if any. This request will be later
> > + * added back to the queue by discard worker.
> > + */
> > + ResetUndoRequestInfo(&urinfo);
> > + urinfo.dbid = dbid;
> > + urinfo.full_xid = fxid;
> > + urinfo.start_urec_ptr = start_urec_ptr[i];
> > + if (!InsertRequestIntoErrorUndoQueue(&urinfo))
> > + RollbackHTMarkEntryInvalid(urinfo.full_xid,
> > + urinfo.start_urec_ptr);
> > + /*
> > + * Errors can reset holdoff count, so restore back. This is
> > + * required because this function can be called after holding
> > + * interrupts.
> > + */
> > + InterruptHoldoffCount = save_holdoff;
> > +
> > + /* Send the error only to server log. */
> > + err_out_to_client(false);
> > + EmitErrorReport();
> > +
> > + success = false;
> > +
> > + /* We should never reach here when we are in a semi-critical-section. */
> > + Assert(SemiCritSectionCount == 0);
>
> This seems entirely and completely broken. You can't just catch an
> exception and continue. What if somebody held an lwlock when the error
> was thrown? A buffer pin?
>

The caller deals with that. For example, when this is called from
FinishPreparedTransaction, we do AbortOutOfAnyTransaction and when
called from ReleaseResourcesAndProcessUndo, we just release locks. I
think we might need to do something additional for
ReleaseResourcesAndProcessUndo. Earlier here also, I had
AbortTransaction but was not sure whether that is the right thing to
do especially because it will lead to RecordTransactionAbort called
twice, once when we do AbortTransaction before applying undo actions
and once when we do it after catching the exception. Like as I said
earlier maybe the right way is to just avoid calling
RecordTransactionAbort again.

>
> > +to complete the requests by themselves. There is an exception to it where when
> > +error queue becomes full, we just mark the request as 'invalid' and continue to
> > +process other requests if any. The discard worker will find this errored
> > +transaction at later point of time and again add it to the request queues.
>
> You say it's an exception, but you do not explain why that exception is
> there.
>

The exception is when the error queue becomes full. The idea is that
individual queues can be full but not the hash table.

> Nor why that's not a problem for:
>
> > +We have the hard limit (proportional to the size of the rollback hash table)
> > +for the number of transactions that can have pending undo. This can help us
> > +in computing the value of oldestXidHavingUnappliedUndo and allowing us not to
> > +accumulate pending undo for a long time which will eventually block the
> > +discard of undo.
>

The reason why it is not a problem is that we don't remove the entry
from the hash table rather just mark it such that later discard worker
can add it to the queues. I am not sure if I understood your question
completely, but let me try to explain this idea in a bit more detail.

The basic idea is that Rollback Hash Table has space equivalent to all
the three queues plus (2 * MaxBackends). Now, we will stop allowing
the new transactions that want to write undo once the hash table has
entries equivalent to all three queues and we have 2 * Max_Backends
already attached to undo logs that are not committed. Assume we have
each queue size as 5 and Max_Backends =10, then ideally we can 35
entries (3 * 5 + 2 * 10) in the hash table. The way all this is
related to the error queue being full is like this:

Say, we have a number of hash table entries equal to 15 which
indicates all queues are full and now 10 backends connected to two
different logs (permanent and unlogged). Next one of the transaction
errors out and try to rollback, at this stage, it will add an entry in
the hash table and try to execute the actions. While executing
actions, it got an error and couldn't add to error queue because it
was full, so at this stage, it just marks the hash table entry as
invalid and proceeds (consider this happens for both logged and
unlogged categories). So, at this stage, we will have 17 entries in
the hash table and the other 9 backends attached to 18 logs which
makes space for 35 xacts if the system crashes at this stage. The
backend which errored out again tries to perform an operation for
which it needs to perform undo. Now, we won't allow this backend to
perform that action because if it crashed after performing the
operation and before committing, the hash table will overflow.

Currently, there are some problems with the hash table overflow checks
in the code that needs to be fixed.

>
> > + /* There might not be any undo log and hibernation might be needed. */
> > + *hibernate = true;
> > +
> > + StartTransactionCommand();
>
> Why do we need this? I assume it's so we can have a resource owner?
>

Yes, and another reason is we are using dbid_exists in this function.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2019-08-07 09:45:38 Re: SegFault on 9.6.14
Previous Message Dmitry Igrishin 2019-08-07 09:14:48 Re: Small patch to fix build on Windows