Quick Links

Re: POC: Cleaning up orphaned files using undo logs

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: POC: Cleaning up orphaned files using undo logs
Date:	2019-04-19 16:43:24
Message-ID:	CA+TgmobYhmeDHf-+APMGHmG9R+3BDLZNDOmbA8LLytAyHvB63g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Apr 19, 2019 at 6:16 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> Currently, undo branch[1] contain an older version of the (undo
> interface + some fixup). Now, I have merged the latest changes from
> the zheap branch[2] to the undo branch[1]
> which can be applied on top of the undo storage commit[3]. For
> merging those changes, I need to add some changes to the undo log
> storage patch as well for handling the multi log transaction. So I
> have attached two patches, 1) improvement to undo log storage 2)
> complete undo interface patch which include 0006+0007 from undo
> branch[1] + new changes on the zheap branch.

Some review comments:

+#define AtAbort_ResetUndoBuffers() ResetUndoBuffers()

I don't think this really belongs in xact.c. Seems like we ought to
declare it in the appropriate header file. Perhaps we also ought to
consider using a static inline function rather than a macro, although
I guess it doesn't really matter.

+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+ UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+ UndoPersistence upersistence = log->meta.persistence;

Could we arrange things so that the caller passes the persistence
level, instead of having to recalculate it here? This doesn't seem to
be a particularly cheap operation. UndoLogGet() will call
get_undo_log() which I guess will normally just do a hash table lookup
but which in the worst case will take an LWLock and perform a linear
search of a shared memory array. But at PrepareUndoInsert time we
knew the persistence level already, so I don't see why
InsertPreparedUndo should have to force it to be looked up all over
again -- and while in a critical section, no less.

Another thing that is strange about this patch is that it adds these
start_urec_ptr and latest_urec_ptr arrays and then uses them for
absolutely nothing. I think that's a pretty clear sign that the
division of this work into multiple patches is not correct. We
shouldn't have one patch that tracks some information that is used
nowhere for anything and then another patch that adds a user of that
information -- the two should go together.

Incidentally, wouldn't StartTransaction() need to reinitialize these fields?

+ * When the undorecord for a transaction gets inserted in the next log then we

undo record

+ * insert a transaction header for the first record in the new log and update
+ * the transaction header with this new logs location. We will also keep

This appears to be nonsensical. You're saying that you add a
transaction header to the new log and then update it with its own
location. That can't be what you mean.

+ * Incase, this is not the first record in new log (aka new log already

"Incase," -> "If"
"aka" -> "i.e."

Overall this whole paragraph is a bit hard to understand.

+ * same transaction spans across multiple logs depending on which log is

delete "across"

+ * processed first by the discard worker. If it processes the first log which
+ * contains the transactions first record, then it can get the last record
+ * of that transaction even if it is in different log and then processes all
+ * the undo records from last to first. OTOH, if the next log get processed

Not sure how that would work if the number of logs is >2.

This whole paragraph is also hard to understand.

+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int buffer_idx;

This is a file-global variable with a very generic name that is very
similar to a local variable name used by multiple functions within the
file (bufidx) and no comments. Ugh.

+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)

The locking regime for this function is really confusing. It requires
that the caller hold discard_lock on entry, and on exit the lock will
still be held if the return value is true but will no longer be held
if the return value is false. Yikes! Anybody who reads code that
uses this function is not going to guess that it has such strange
behavior. I'm not exactly sure how to redesign this, but I think it's
not OK the way you have it. One option might be to inline the logic
into each call site.

+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer. This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo. This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(int idx)

It bugs me that this function goes back in to reacquire the discard
lock for the purpose of preventing a concurrent undo discard.
Apparently, if the other transaction's undo has been discarded between
the prepare phase and where we are now, we're OK with that and just
exit without doing anything; otherwise, we update the previous
transaction header. But that seems wrong. When we enter a critical
section, I think we should aim to know exactly what modifications we
are going to make within that critical section.

I also wonder how the concurrent discard could really happen. We must
surely be holding exclusive locks on the relevant buffers -- can undo
discard really discard undo when the relevant buffers are x-locked?

It seems to me that remaining_bytes is a crock that should be ripped
out entirely, both here and in InsertUndoRecord. It seems to me that
UndoRecordUpdateTransInfo can just contrive to set remaining_bytes
correctly. e.g.

do
{
// stuff
if (!BufferIsValid(buffer))
{
Assert(InRecovery);
already_written += (BLCKSZ - starting_byte);
done = (already_written >= undo_len);
}
else
{
page = BufferGetPage(buffer);
done = InsertUndoRecord(...);
MarkBufferDirty(buffer);
}
} while (!done);

InsertPreparedUndo needs similar treatment.

To make this work, I guess the long string of assignments in
InsertUndoRecord will need to be made unconditional, but that's
probably pretty cheap. As a fringe benefit, all of those work_blah
variables that are currently referencing file-level globals can be
replaced with something local to this function, which will surely
please the coding style police.

+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, FullTransactionId fxid,

xid vs fxid

+ urec->uur_xidepoch = EpochFromFullTransactionId(fxid);

We need to make some decisions about how we're going to handle 64-bit
XIDs vs. 32-bit XIDs in undo. This doesn't look like a well-considered
scheme. In general, PrepareUndoInsert expects the caller to have
populated the UnpackedUndoRecord, but here, uur_xidepoch is getting
overwritten with the high bits of the caller-specified XID. The
low-order bits aren't stored anywhere by this function, but the caller
is presumably expected to have placed them inside urec->uur_xid. And
it looks like the low-order bits (urec->uur_xid) get stored for every
undo record, but the high-order bits (urec->xidepoch) only get stored
when we emit a transaction header. This all seems very confusing.

I would really like to see us replace uur_xid and uur_xidepoch with a
FullTransactionId; now that we have that concept, it seems like bad
practice to break what is really a FullTransactionId into two halves
and store them separately. However, it would be a bit unfortunate to
store an additional 4 bytes of mostly-static data in every undo
record. What if we went the other way? That is, remove urec_xid
from UndoRecordHeader and from UnpackedUndoRecord. Make it the
responsibility of whoever is looking up an undo record to know which
transaction's undo they are searching. zheap, at least, generally
does know this: if it's starting from a page, then it has the XID +
epoch available from the transaction slot, and if it's starting from
an undo record, you need to know the XID for which you are searching,
I guess from uur_prevxid.

I also think that we need to remove uur_prevxid. That field does not
seem to be properly a general-purpose part of the undo machinery, but
a zheap-specific consideration. I think it's job is to tell you which
transaction last modified the current tuple, but zheap can put that
data in the payload if it likes. It is a waste to store it in every
undo record, because it's not needed if the older undo has already
been discarded or if the operation is an insert.

+ * Insert a previously-prepared undo record. This will write the actual undo

Looks like this now inserts all previously-prepared undo records
(rather than just a single one).

+ * in page. We start writting immediately after the block header.

Spelling.

+ * Helper function for UndoFetchRecord. It will fetch the undo record pointed
+ * by urp and unpack the record into urec. This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block. Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+ UndoPersistence persistence)

I don't really understand why uur_buffer is part of an
UnpackedUndoRecord. It doesn't look like the other fields, which tell
you about the contents of an undo record that you have created or that
you have parsed. Instead, it's some kind of scratch space for keeping
track of a buffer that we're using in the process of reading an undo
record. It looks like it should be an argument to UndoGetOneRecord()
and ResetUndoRecord().

I also wonder whether it's really a good design to make the caller
responsible for invalidating the buffer before accessing another
block. Maybe it would be simpler to have this function just check
whether the buffer it's been given is the right one; if not, unpin it
and pin the new one instead. But I'm not really sure...

+ /* If we already have a buffer pin then no need to allocate a new one. */
+ if (!BufferIsValid(buffer))
+ {
+ buffer = ReadBufferWithoutRelcache(SMGR_UNDO,
+ rnode, UndoLogForkNum, cur_blk,
+ RBM_NORMAL, NULL,
+ RelPersistenceForUndoPersistence(persistence));
+
+ urec->uur_buffer = buffer;
+ }

I think you should move this code inside the loop that follows. Then
at the bottom of that loop, instead of making a similar call, just set
buffer = InvalidBuffer. Then when you loop around it'll do the right
thing and you'll need less code.

Notice that having both the local variable buffer and the structure
member urec->uur_buffer is actually making this code more complex. You
are already setting urec->uur_buffer = InvalidBuffer when you do
UnlockReleaseBuffer(). If you didn't have a separate 'buffer'
variable you wouldn't need to clear them both here. In fact I think
what you should have is an argument Buffer *curbuf, or something like
that, and no uur_buffer at all.

+ /*
+ * If we have copied the data then release the buffer, otherwise, just
+ * unlock it.
+ */
+ if (is_undo_rec_split)
+ UnlockReleaseBuffer(buffer);
+ else
+ LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

Ugh. I think what's going on here is: UnpackUndoRecord copies the
data if it switches buffers, but not otherwise. So if the record is
split, we can release the lock and pin, but otherwise we have to keep
the pin to avoid having the data get clobbered. But having this code
know about what UnpackUndoRecord does internally seems like an
abstraction violation. It's also probably not right if we ever want
to fetch undo records in bulk, as I see that the latest code in zheap
master does. I think we should switch UnpackUndoRecord over to always
copying the data and just avoid all this.

(To make that cheaper, we may want to teach UnpackUndoRecord to store
data into scratch space provided by the caller rather than using
palloc to get its own space, but I'm not actually sure that's (a)
worth it or (b) actually better.)

[ Skipping over some things ]

+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+ int *already_decoded, bool header_only)

I think we should split this function into three functions that use a
context object, call it say UnpackUndoContext. The caller will do:

BeginUnpackUndo(&ucontext); // just once
UnpackUndoData(&ucontext, page, starting_byte); // any number of times
FinishUnpackUndo(&uur, &ucontext); // just once

The undo context will store an enum value that tells us the "stage" of decoding:

- UNDO_DECODE_STAGE_HEADER: We have not yet decoded even the record
header; we need to do that next.
- UNDO_DECODE_STAGE_RELATION_DETAILS: The next thing to be decoded is
the relation details, if present.
- UNDO_DECODE_STAGE_BLOCK: The next thing to be decoded is the block
details, if present.
- UNDO_DECODE_STAGE_TRANSACTION: The next thing to be decoded is the
transaction details, if present.
- UNDO_DECODE_STAGE_PAYLOAD: The next thing to be decoded is the
payload details, if present.
- UNDO_DECODE_STAGE_DONE: Decoding is complete.

It will also store the number of bytes that have been already been
copied as part of whichever stage is current. A caller who wants only
part of the record can stop when ucontext.stage > desired_stage; e.g.
the current header_only flag corresponds to stopping when
ucontext.stage > UNDO_DECODE_STAGE_HEADER, and the potential
optimization mentioned in UndoGetOneRecord could be done by stopping
when ucontext.stage > UNDO_DECODE_STAGE_BLOCK (although I don't know
if that's worth doing).

In this scheme, BeginUnpackUndo just needs to set the stage to
UNDO_DECODE_STAGE_HEADER and the number of bytes copied to 0. The
context object contains all the work_blah things (no more global
variables!), but BeginUnpackUndo does not need to initialize them,
since they will be overwritten before they are examined. And
FinishUnpackUndo just needs to copy all of the fields from the
work_blah things into the UnpackedUndoRecord. The tricky part is
UnpackUndoData itself, which I propose should look like a big switch
where all branches fall through. Roughly:

switch (ucontext->stage)
{
case UNDO_DECODE_STAGE_HEADER:
if (!ReadUndoBytes(...))
return;
stage = UNDO_DECODE_STAGE_RELATION_DETAILS;
/* FALLTHROUGH */
case UNDO_DECODE_STAGE_RELATION_DETAILS:
if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
{
if (!ReadUndoBytes(...))
return;
}
stage = UNDO_DECODE_STAGE_BLOCK;
/* FALLTHROUGH */
etc.

ReadUndoBytes would need some adjustments in this scheme; it wouldn't
need my_bytes_read any more since it would only get called for
structures that are not yet completely read. (Regardless of whether
we adopt this idea, the nocopy flag to ReadUndoBytes appears to be
unused and can be removed.)

We could use a similar context object for InsertUndoRecord.
BeginInsertUndoRecord(&ucontext, &uur) would initialize all of the
work_blah structures within the context object. InsertUndoData will
be a big switch. Maybe no need for a "finish" function here. There
can also be a SkipInsertingUndoData function that can be called
instead of InsertUndoData if the page is discarded. I think this
would be more elegant than what we've got now.

This is not a complete review, but I'm out of time and energy for the moment...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: POC: Cleaning up orphaned files using undo logs at 2019-04-19 10:15:43 from Dilip Kumar

Responses

Re: POC: Cleaning up orphaned files using undo logs at 2019-04-25 11:15:44 from Dilip Kumar

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2019-04-19 16:47:39	Re: TM format can mix encodings in to_char()
Previous Message	Justin Pryzby	2019-04-19 14:43:01	Re: clean up docs for v12