Quick Links

Re: POC: Cleaning up orphaned files using undo logs

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: POC: Cleaning up orphaned files using undo logs
Date:	2019-08-05 18:29:34
Message-ID:	20190805182934.cgmxmxnjhm3ohhhf@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

(as I was out of context due to dealing with bugs, I've switched to
lookign at the current zheap/undoprocessing branch.

On 2019-07-30 01:02:20 -0700, Andres Freund wrote:
> +/*
> + * Insert a previously-prepared undo records.
> + *
> + * This function will write the actual undo record into the buffers which are
> + * already pinned and locked in PreparedUndoInsert, and mark them dirty. This
> + * step should be performed inside a critical section.
> + */

Again, I think it's not ok to just assume you can lock an essentially
unbounded number of buffers. This seems almost guaranteed to result in
deadlocks. And there's limits on how many lwlocks one can hold etc.

As far as I can tell there's simply no deadlock avoidance scheme in use
here *at all*? I must be missing something.

> + /* Main loop for writing the undo record. */
> + do
> + {

I'd prefer this to not be a do{} while(true) loop - as written I need to
read to the end to see what the condition is. I don't think we have any
loops like that in the code.

> + /*
> + * During recovery, there might be some blocks which are already
> + * deleted due to some discard command so we can just skip
> + * inserting into those blocks.
> + */
> + if (!BufferIsValid(buffer))
> + {
> + Assert(InRecovery);
> +
> + /*
> + * Skip actual writing just update the context so that we have
> + * write offset for inserting into next blocks.
> + */
> + SkipInsertingUndoData(&ucontext, BLCKSZ - starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + break;
> + }

How exactly can this happen?

> + else
> + {
> + page = BufferGetPage(buffer);
> +
> + /*
> + * Initialize the page whenever we try to write the first
> + * record in page. We start writing immediately after the
> + * block header.
> + */
> + if (starting_byte == UndoLogBlockHeaderSize)
> + UndoPageInit(page, BLCKSZ, prepared_undo->urec->uur_info,
> + ucontext.already_processed,
> + prepared_undo->urec->uur_tuple.len,
> + prepared_undo->urec->uur_payload.len);
> +
> + /*
> + * Try to insert the record into the current page. If it
> + * doesn't succeed then recall the routine with the next page.
> + */
> + InsertUndoData(&ucontext, page, starting_byte);
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + {
> + MarkBufferDirty(buffer);
> + break;

At this point we're five indentation levels deep. I'd extract at least
either the the per prepared undo code or the code performing the writing
across block boundaries into a separate function. Perhaps both.

> +/*
> + * Helper function for UndoGetOneRecord
> + *
> + * If any of rmid/reloid/xid/cid is not available in the undo record, then
> + * it will get the information from the first complete undo record in the
> + * page.
> + */
> +static void
> +GetCommonUndoRecInfo(UndoPackContext *ucontext, UndoRecPtr urp,
> + RelFileNode rnode, UndoLogCategory category, Buffer buffer)
> +{
> + /*
> + * If any of the common header field is not available in the current undo
> + * record then we must read it from the first complete record of the page.
> + */

How is it guaranteed that the first record on the page is actually from
the current transaction? Can't there be a situation where that's from
another transaction?

> +/*
> + * Helper function for UndoFetchRecord and UndoBulkFetchRecord
> + *
> + * curbuf - If an input buffer is valid then this function will not release the
> + * pin on that buffer. If the buffer is not valid then it will assign curbuf
> + * with the first buffer of the current undo record and also it will keep the
> + * pin and lock on that buffer in a hope that while traversing the undo chain
> + * the caller might want to read the previous undo record from the same block.
> + */

Wait, so at exit *curbuf is pinned but not locked, if passed in, but is
pinned *and* locked when not? That'd not be a sane API. I don't think
the code works like that atm though.

> +static UnpackedUndoRecord *
> +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
> + UndoLogCategory category, Buffer *curbuf)
> +{
> + Page page;
> + int starting_byte = UndoRecPtrGetPageOffset(urp);
> + BlockNumber cur_blk;
> + UndoPackContext ucontext = {{0}};
> + Buffer buffer = *curbuf;
> +
> + cur_blk = UndoRecPtrGetBlockNum(urp);
> +
> + /* Initiate unpacking one undo record. */
> + BeginUnpackUndo(&ucontext);
> +
> + while (true)
> + {
> + /* If we already have a buffer then no need to allocate a new one. */
> + if (!BufferIsValid(buffer))
> + {
> + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
> + RBM_NORMAL, NULL,
> + RelPersistenceForUndoLogCategory(category));
> +
> + /*
> + * Remember the first buffer where this undo started as next undo
> + * record what we fetch might fall on the same buffer.
> + */
> + if (!BufferIsValid(*curbuf))
> + *curbuf = buffer;
> + }
> +
> + /* Acquire shared lock on the buffer before reading undo from it. */
> + LockBuffer(buffer, BUFFER_LOCK_SHARE);
> +
> + page = BufferGetPage(buffer);
> +
> + UnpackUndoData(&ucontext, page, starting_byte);
> +
> + /*
> + * We are done if we have reached to the done stage otherwise move to
> + * next block and continue reading from there.
> + */
> + if (ucontext.stage == UNDO_PACK_STAGE_DONE)
> + {
> + if (buffer != *curbuf)
> + UnlockReleaseBuffer(buffer);
> +
> + /*
> + * Get any of the missing fields from the first record of the
> + * page.
> + */
> + GetCommonUndoRecInfo(&ucontext, urp, rnode, category, *curbuf);
> + break;
> + }
> +
> + /*
> + * The record spans more than a page so we would have copied it (see
> + * UnpackUndoRecord). In such cases, we can release the buffer.
> + */

Where would it have been copied? Presumably in UnpackUndoData()? Imo the
comment should say so.

I'm a bit confused by the use of "would" in that comment. Either we
have, or not?

> + if (buffer != *curbuf)
> + UnlockReleaseBuffer(buffer);

Wait, so we *keep* the buffer locked if it the same as *curbuf? That
can't be right.

> + * Fetch the undo record for given undo record pointer.
> + *
> + * This will internally allocate the memory for the unpacked undo record which
> + * intern will

"intern" should probably be internally? But I'm not sure what the two
"internally"s really add here.

> +/*
> + * Release the memory of the undo record allocated by UndoFetchRecord and
> + * UndoBulkFetchRecord.
> + */
> +void
> +UndoRecordRelease(UnpackedUndoRecord *urec)
> +{
> + /* Release the memory of payload data if we allocated it. */
> + if (urec->uur_payload.data)
> + pfree(urec->uur_payload.data);
> +
> + /* Release memory of tuple data if we allocated it. */
> + if (urec->uur_tuple.data)
> + pfree(urec->uur_tuple.data);
> +
> + /* Release memory of the transaction header if we allocated it. */
> + if (urec->uur_txn)
> + pfree(urec->uur_txn);
> +
> + /* Release memory of the logswitch header if we allocated it. */
> + if (urec->uur_logswitch)
> + pfree(urec->uur_logswitch);
> +
> + /* Release the memory of the undo record. */
> + pfree(urec);
> +}

Those comments before each pfree are not useful.

Also, isn't this both fairly slow and fairly failure prone? The next
record is going to need all that memory again, no? It seems to me that
there should be one record that's allocated once, and then reused over
multiple fetches, increasing the size if necesssary.

I'm very doubtful that all this freeing of individual allocations in the
undo code makes sense. Shouldn't this just be done in short lived memory
contexts, that then get reset as a whole? That's both far less failure
prone, and faster.

> + * one_page - Caller is applying undo only for one block not for
> + * complete transaction. If this is set true then instead
> + * of following transaction undo chain using prevlen we will
> + * follow the block prev chain of the block so that we can
> + * avoid reading many unnecessary undo records of the
> + * transaction.
> + */
> +UndoRecInfo *
> +UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr,
> + int undo_apply_size, int *nrecords, bool one_page)

There's no caller for one_page mode in the series - I assume that's for
later, during page-wise undo? It seems to behave in quite noticably
different ways, is that really OK? Makes the code quite hard to
understand.

Also, it seems quite poorly named to me. It sounds like it's about
fetching a single undo page (which makes no sense, obviously). But what
it does is to switch to an entirely different way of traversing the undo
chains.

> + /*
> + * In one_page mode we are fetching undo only for one page instead of
> + * fetching all the undo of the transaction. Basically, we are fetching
> + * interleaved undo records. So it does not make sense to do any prefetch
> + * in that case.

What does "interleaved" mean here? I assume that there will often be
other UNDO records interspersed? But that's not guaranteed at all,
right? In fact, for a lot of workloads it seems likely that there will
be many consecutive undo records for a single page? In fact, won't that
be the majority of cases?

Thus it's not obvious to me that there's not often going to be
consecutive pages for this case too. I'd even say that minimizing IO
delay is *MORE* important during page-wise undo, as that happens in the
context of client accesses, and it's not incurring cost on the party
that performed DML, but on some random third party.

I'm doubtful this is a sane interface. There's a lot of duplication
between one_page and not one_page. It presupposes specific ways of
constructing chains that are likely to depend on the AM. to_urecptr is
only used in certain situations. E.g. I strongly suspect that for
zheap's visibility determinations we'd want to concurrently follow all
the necessary chains to determine visibility for all all tuples on the
page, far enough to find the visible tuple - for seqscan's / bitmap heap
scans / everything using page mode scans, that'll be way more efficient
than doing this one-by-one and possibly even repeatedly. But what is
exactly the right thing to do is going to be highly AM specific.

I vaguely suspect what you'd want is an interface where the "bulk fetch"
context basically has a FIFO queue of undo records to fetch, and a
function to actually perform fetching. Whenever a record has been
retrieved, a callback determines whether additional records are needed.
In the case of fetching all the undo for a transaction, you'd just queue
- probably in a more efficient representation - all the necessary
undo. In case of page-wise undo, you'd queue the first record of the
chain you'd want to undo, with a callback for queuing the next
record. For visibility determinations in zheap, you'd queue all the
different necessary chains, with a callback that queues the next
necessary record if still needed for visibility determination.

And then I suspect you'd have a separate callback whenever records have
been fetched, with all the 'unconsumed' records. That then can,
e.g. based on memory consumption, decide to process them or not. For
visibility information you'd probably just want to condense the records
to the minimum necessary (i.e. visibility information for the relevant
tuples, and the visibile tuple when encountered) as soon as available.

Obviously that's pretty handwavy.

> Also, if we are fetching undo records from more than one
> + * log, we don't know the boundaries for prefetching. Hence, we can't use
> + * prefetching in this case.
> + */

Hm. Why don't we know the boundaries (or cheaply infer them)?

> + /*
> + * If prefetch_pages are half of the prefetch_target then it's time to
> + * prefetch again.
> + */
> + if (prefetch_pages < prefetch_target / 2)
> + PrefetchUndoPages(rnode, prefetch_target, &prefetch_pages, to_blkno,
> + from_blkno, category);

Hm. Why aren't we prefetching again as soon as possible? Given the
current code there's not really benefit in fetching many adjacent pages
at once. And this way it seems we're somewhat likely to cause fairly
bursty IO?

> + /*
> + * In one_page mode it's possible that the undo of the transaction
> + * might have been applied by worker and undo got discarded. Prevent
> + * discard worker from discarding undo data while we are reading it.
> + * See detail comment in UndoFetchRecord. In normal mode we are
> + * holding transaction undo action lock so it can not be discarded.
> + */

I don't really see a comment explaining this in UndoFetchRecord. Are
you referring to InHotStandby? Because there's no comment about one_page
mode as far as I can tell? The comment is clearly referring to that,
rather than InHotStandby?

> + if (one_page)
> + {
> + /* Refer comments in UndoFetchRecord. */

Missing "to".

> + if (InHotStandby)
> + {
> + if (UndoRecPtrIsDiscarded(urecptr))
> + break;
> + }
> + else
> + {
> + LWLockAcquire(&slot->discard_lock, LW_SHARED);
> + if (slot->logno != logno || urecptr < slot->oldest_data)
> + {
> + /*
> + * The undo log slot has been recycled because it was
> + * entirely discarded, or the data has been discarded
> + * already.
> + */
> + LWLockRelease(&slot->discard_lock);
> + break;
> + }
> + }

I find this deeply unsatisfying. It's repeated in a bunch of
places. There's completely different behaviour between the hot-standby
and !hot-standby case. There's UndoRecPtrIsDiscarded for the HS case,
but we do a different test for !HS. There's no explanation as to why
this is even reachable.

> + /* Read the undo record. */
> + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);
> +
> + /* Release the discard lock after fetching the record. */
> + if (!InHotStandby)
> + LWLockRelease(&slot->discard_lock);
> + }
> + else
> + UndoGetOneRecord(uur, urecptr, rnode, category, &buffer);

And then we do none of this in !one_page mode.

> + /*
> + * As soon as the transaction id is changed we can stop fetching the
> + * undo record. Ideally, to_urecptr should control this but while
> + * reading undo only for a page we don't know what is the end undo
> + * record pointer for the transaction.
> + */
> + if (one_page)
> + {
> + if (!FullTransactionIdIsValid(fxid))
> + fxid = uur->uur_fxid;
> + else if (!FullTransactionIdEquals(fxid, uur->uur_fxid))
> + break;
> + }
> +
> + /* Remember the previous undo record pointer. */
> + prev_urec_ptr = urecptr;
> +
> + /*
> + * Calculate the previous undo record pointer of the transaction. If
> + * we are reading undo only for a page then follow the blkprev chain
> + * of the page. Otherwise, calculate the previous undo record pointer
> + * using transaction's current undo record pointer and the prevlen. If
> + * undo record has a valid uur_prevurp, this is the case of log switch
> + * during the transaction so we can directly use uur_prevurp as our
> + * previous undo record pointer of the transaction.
> + */
> + if (one_page)
> + urecptr = uur->uur_prevundo;
> + else if (uur->uur_logswitch)
> + urecptr = uur->uur_logswitch->urec_prevurp;
> + else if (prev_urec_ptr == to_urecptr ||
> + uur->uur_info & UREC_INFO_TRANSACTION)
> + urecptr = InvalidUndoRecPtr;
> + else
> + urecptr = UndoGetPrevUndoRecptr(prev_urec_ptr, buffer, category);
> +

FWIW, this is one of those concerns I was referring to above. What
exactly needs to happen seems highly AM specific.

> +/*
> + * Read length of the previous undo record.
> + *
> + * This function will take an undo record pointer as an input and read the
> + * length of the previous undo record which is stored at the end of the previous
> + * undo record. If the undo record is split then this will add the undo block
> + * header size in the total length.
> + */

This should add some note as to when it's expected to be necessary. I
was kind of concerned that this can be necessary, but it's only needed
during log switches, which disarms that concern.

> +static uint16
> +UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> + UndoLogCategory category)
> +{
> + UndoLogOffset page_offset = UndoRecPtrGetPageOffset(urp);
> + BlockNumber cur_blk = UndoRecPtrGetBlockNum(urp);
> + Buffer buffer = input_buffer;
> + Page page = NULL;
> + char *pagedata = NULL;
> + char prevlen[2];
> + RelFileNode rnode;
> + int byte_to_read = sizeof(uint16);

Shouldn't it be byte_to_read? And the sizeof a type that's tied with the
actual undo format? Imagine we'd ever want to change the length format
for undo records - this would be hard to find.

> + char persistence;
> + uint16 prev_rec_len = 0;
> +
> + /* Get relfilenode. */
> + UndoRecPtrAssignRelFileNode(rnode, urp);
> + persistence = RelPersistenceForUndoLogCategory(category);
> +
> + if (BufferIsValid(buffer))
> + {
> + page = BufferGetPage(buffer);
> + pagedata = (char *) page;
> + }
> +
> + /*
> + * Length if the previous undo record is store at the end of that record
> + * so just fetch last 2 bytes.
> + */
> + while (byte_to_read > 0)
> + {

Why does this need a loop around the number of bytes? Can there ever be
a case where this is split across a record? If so, isn't that a bad idea
anyway?

> + /* Read buffer if the current buffer is not valid. */
> + if (!BufferIsValid(buffer))
> + {
> + buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum,
> + cur_blk, RBM_NORMAL, NULL,
> + persistence);
> +
> + LockBuffer(buffer, BUFFER_LOCK_SHARE);
> +
> + page = BufferGetPage(buffer);
> + pagedata = (char *) page;
> + }
> +
> + page_offset -= 1;
> +
> + /*
> + * Read current prevlen byte from current block if page_offset hasn't
> + * reach to undo block header. Otherwise, go to the previous block
> + * and continue reading from there.
> + */
> + if (page_offset >= UndoLogBlockHeaderSize)
> + {
> + prevlen[byte_to_read - 1] = pagedata[page_offset];
> + byte_to_read -= 1;
> + }
> + else
> + {
> + /*
> + * Release the current buffer if it is not provide by the caller.
> + */
> + if (input_buffer != buffer)
> + UnlockReleaseBuffer(buffer);
> +
> + /*
> + * Could not read complete prevlen from the current block so go to
> + * the previous block and start reading from end of the block.
> + */
> + cur_blk -= 1;
> + page_offset = BLCKSZ;
> +
> + /*
> + * Reset buffer so that we can read it again for the previous
> + * block.
> + */
> + buffer = InvalidBuffer;
> + }
> + }

I can't help but think that this shouldn't be yet another copy of logic
for how to read undo pages.

Need to do something else for a bit. More later.

Greetings,

Andres Freund

In response to

Re: POC: Cleaning up orphaned files using undo logs at 2019-07-30 08:02:20 from Andres Freund

Responses

Re: POC: Cleaning up orphaned files using undo logs at 2019-08-06 07:56:26 from Andres Freund
Re: POC: Cleaning up orphaned files using undo logs at 2019-08-13 11:35:27 from Dilip Kumar

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephen Frost	2019-08-05 18:35:38	Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions
Previous Message	Tom Lane	2019-08-05 18:29:13	Re: [PATCH] Stop ALTER SYSTEM from making bad assumptions