Re: POC: Cleaning up orphaned files using undo logs

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-13 12:02:03
Message-ID: CAFiTN-v9R4q0_Bx3_K50-9b+qvXHWdjwvike9AoufEEBoBnVKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 6, 2019 at 1:26 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2019-08-05 11:29:34 -0700, Andres Freund wrote:
> > Need to do something else for a bit. More later.
>
>
> > + /*
> > + * Compute the header size of the undo record.
> > + */
> > +Size
> > +UndoRecordHeaderSize(uint16 uur_info)
> > +{
> > + Size size;
> > +
> > + /* Add fixed header size. */
> > + size = SizeOfUndoRecordHeader;
> > +
> > + /* Add size of transaction header if it presets. */
> > + if ((uur_info & UREC_INFO_TRANSACTION) != 0)
> > + size += SizeOfUndoRecordTransaction;
> > +
> > + /* Add size of rmid if it presets. */
> > + if ((uur_info & UREC_INFO_RMID) != 0)
> > + size += sizeof(RmgrId);
> > +
> > + /* Add size of reloid if it presets. */
> > + if ((uur_info & UREC_INFO_RELOID) != 0)
> > + size += sizeof(Oid);
> > +
> > + /* Add size of fxid if it presets. */
> > + if ((uur_info & UREC_INFO_XID) != 0)
> > + size += sizeof(FullTransactionId);
> > +
> > + /* Add size of cid if it presets. */
> > + if ((uur_info & UREC_INFO_CID) != 0)
> > + size += sizeof(CommandId);
> > +
> > + /* Add size of forknum if it presets. */
> > + if ((uur_info & UREC_INFO_FORK) != 0)
> > + size += sizeof(ForkNumber);
> > +
> > + /* Add size of prevundo if it presets. */
> > + if ((uur_info & UREC_INFO_PREVUNDO) != 0)
> > + size += sizeof(UndoRecPtr);
> > +
> > + /* Add size of the block header if it presets. */
> > + if ((uur_info & UREC_INFO_BLOCK) != 0)
> > + size += SizeOfUndoRecordBlock;
> > +
> > + /* Add size of the log switch header if it presets. */
> > + if ((uur_info & UREC_INFO_LOGSWITCH) != 0)
> > + size += SizeOfUndoRecordLogSwitch;
> > +
> > + /* Add size of the payload header if it presets. */
> > + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> > + size += SizeOfUndoRecordPayload;
>
> There's numerous blocks with one if for each type, and the body copied
> basically the same for each alternative. That doesn't seem like a
> reasonable approach to me. Means that many places need to be adjusted
> when we invariably add another type, and seems likely to lead to bugs
> over time.
I think I have expressed my thought on this in another email
[https://www.postgresql.org/message-id/CAFiTN-vDrXuL6tHK1f_V9PAXp2%2BEFRpPtxCG_DRx08PZXAPkyw%40mail.gmail.com]
>
> > + /* Add size of the payload header if it presets. */
>
> FWIW, repeating the same comment, with or without minor differences, 10
> times is a bad idea. Especially when the comment doesn't add *any* sort
> of information.
Ok, fixed
>
> Also, "if it presets" presumably is a typo?
Fixed
>
>
> > +/*
> > + * Compute and return the expected size of an undo record.
> > + */
> > +Size
> > +UndoRecordExpectedSize(UnpackedUndoRecord *uur)
> > +{
> > + Size size;
> > +
> > + /* Header size. */
> > + size = UndoRecordHeaderSize(uur->uur_info);
> > +
> > + /* Payload data size. */
> > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + size += uur->uur_payload.len;
> > + size += uur->uur_tuple.len;
> > + }
> > +
> > + /* Add undo record length size. */
> > + size += sizeof(uint16);
> > +
> > + return size;
> > +}
> > +
> > +/*
> > + * Calculate the size of the undo record stored on the page.
> > + */
> > +static inline Size
> > +UndoRecordSizeOnPage(char *page_ptr)
> > +{
> > + uint16 uur_info = ((UndoRecordHeader *) page_ptr)->urec_info;
> > + Size size;
> > +
> > + /* Header size. */
> > + size = UndoRecordHeaderSize(uur_info);
> > +
> > + /* Payload data size. */
> > + if ((uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + UndoRecordPayload *payload = (UndoRecordPayload *) (page_ptr + size);
> > +
> > + size += payload->urec_payload_len;
> > + size += payload->urec_tuple_len;
> > + }
> > +
> > + return size;
> > +}
> > +
> > +/*
> > + * Compute size of the Unpacked undo record in memory
> > + */
> > +Size
> > +UnpackedUndoRecordSize(UnpackedUndoRecord *uur)
> > +{
> > + Size size;
> > +
> > + size = sizeof(UnpackedUndoRecord);
> > +
> > + /* Add payload size if record contains payload data. */
> > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + size += uur->uur_payload.len;
> > + size += uur->uur_tuple.len;
> > + }
> > +
> > + return size;
> > +}
>
> These functions are all basically the same. We shouldn't copy code over
> and over like this.
UnpackedUndoRecordSize -> computes the size of the unpacked undo
record so it's different from above two, just payload part is common
so moved payload size to common function.

UndoRecordExpectedSize and UndoRecordSizeOnPage are two different
functions except for the header size computation so I already had the
common function for the header. UndoRecordExpectedSize computes the
expected record size so it can access the payload directly from the
unpack undo record whereas the UndoRecordSizeOnPage needs to calculate
the record size by the record pointer which is already stored on the
page so actually it doesn't have the unpacked undo record instead it
first need to compute the header size and then it needs to reach to
the payload data. Typecast that to payload header and compute the
length. In unpack undo record payload is stored as StringInfoData
whereas on the page it is packed as UndoRecordPayload header. So I
am not sure how to unify them. Anyway, UndoRecordSizeOnPage is
required only for undo page consistency checker patch so I have moved
out of this patch. Later, I am planning to handle the comments of the
undo page consistency checker patch so I will try to work on this
function if I can improve it.

>
>
> > +/*
> > + * Initiate inserting an undo record.
> > + *
> > + * This function will initialize the context for inserting and undo record
> > + * which will be inserted by calling InsertUndoData.
> > + */
> > +void
> > +BeginInsertUndo(UndoPackContext *ucontext, UnpackedUndoRecord *uur)
> > +{
> > + ucontext->stage = UNDO_PACK_STAGE_HEADER;
> > + ucontext->already_processed = 0;
> > + ucontext->partial_bytes = 0;
> > +
> > + /* Copy undo record header. */
> > + ucontext->urec_hd.urec_type = uur->uur_type;
> > + ucontext->urec_hd.urec_info = uur->uur_info;
> > +
> > + /* Copy undo record transaction header if it is present. */
> > + if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
> > + memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction);
> > +
> > + /* Copy rmid if present. */
> > + if ((uur->uur_info & UREC_INFO_RMID) != 0)
> > + ucontext->urec_rmid = uur->uur_rmid;
> > +
> > + /* Copy reloid if present. */
> > + if ((uur->uur_info & UREC_INFO_RELOID) != 0)
> > + ucontext->urec_reloid = uur->uur_reloid;
> > +
> > + /* Copy fxid if present. */
> > + if ((uur->uur_info & UREC_INFO_XID) != 0)
> > + ucontext->urec_fxid = uur->uur_fxid;
> > +
> > + /* Copy cid if present. */
> > + if ((uur->uur_info & UREC_INFO_CID) != 0)
> > + ucontext->urec_cid = uur->uur_cid;
> > +
> > + /* Copy undo record relation header if it is present. */
> > + if ((uur->uur_info & UREC_INFO_FORK) != 0)
> > + ucontext->urec_fork = uur->uur_fork;
> > +
> > + /* Copy prev undo record pointer if it is present. */
> > + if ((uur->uur_info & UREC_INFO_PREVUNDO) != 0)
> > + ucontext->urec_prevundo = uur->uur_prevundo;
> > +
> > + /* Copy undo record block header if it is present. */
> > + if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
> > + {
> > + ucontext->urec_blk.urec_block = uur->uur_block;
> > + ucontext->urec_blk.urec_offset = uur->uur_offset;
> > + }
> > +
> > + /* Copy undo record log switch header if it is present. */
> > + if ((uur->uur_info & UREC_INFO_LOGSWITCH) != 0)
> > + memcpy(&ucontext->urec_logswitch, uur->uur_logswitch,
> > + SizeOfUndoRecordLogSwitch);
> > +
> > + /* Copy undo record payload header and data if it is present. */
> > + if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + ucontext->urec_payload.urec_payload_len = uur->uur_payload.len;
> > + ucontext->urec_payload.urec_tuple_len = uur->uur_tuple.len;
> > + ucontext->urec_payloaddata = uur->uur_payload.data;
> > + ucontext->urec_tupledata = uur->uur_tuple.data;
> > + }
> > + else
> > + {
> > + ucontext->urec_payload.urec_payload_len = 0;
> > + ucontext->urec_payload.urec_tuple_len = 0;
> > + }
> > +
> > + /* Compute undo record expected size and store in the context. */
> > + ucontext->undo_len = UndoRecordExpectedSize(uur);
> > +}
>
> It really can't be right to have all these fields basically twice, in
> UnackedUndoRecord, and UndoPackContext. And then copy them one-by-one.
> I mean there's really just some random differences (ordering, some field
> names) between the structures, but otherwise they're the same?
>
> What on earth do we gain by this? This entire intermediate stage makes
> no sense at all to me. We copy data into an UndoRecord, then we copy
> into an UndoRecordContext, with essentially a field-by-field copy
> logic. Then we have another field-by-field logic that copies the data
> into the page.
The idea was that in UnpackedUndoRecord we have all member as a field
by field but in context, we can keep them in headers for example
UndoRecordHeader, UndoRecordGroup, UndoRecordBlock. And, the idea
behind this is that during InsertUndoData instead of calling
InsertUndoByte field by field we call it once for each header because
either we have to write all field of that header or none. But later
we end up having a lot of optional headers and most of them have just
one field in it so it appears that we are copying field by field.

One alternative could be that we palloc a memory in context and then
pack each field in that memory (except the payload and tuple data)
then in one InsertUndoByte call we can insert complete header part and
in we can have 2 more calls to InsertUndoBytes for writing payload and
tuple data. What's your thought on this.

>
>
>
>
> > +/*
> > + * Insert the undo record into the input page from the unpack undo context.
> > + *
> > + * Caller can call this function multiple times until desired stage is reached.
> > + * This will write the undo record into the page.
> > + */
> > +void
> > +InsertUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> > +{
> > + char *writeptr = (char *) page + starting_byte;
> > + char *endptr = (char *) page + BLCKSZ;
> > +
> > + switch (ucontext->stage)
> > + {
> > + case UNDO_PACK_STAGE_HEADER:
> > + /* Insert undo record header. */
> > + if (!InsertUndoBytes((char *) &ucontext->urec_hd,
> > + SizeOfUndoRecordHeader, &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + ucontext->stage = UNDO_PACK_STAGE_TRANSACTION;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_TRANSACTION:
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_TRANSACTION) != 0)
> > + {
> > + /* Insert undo record transaction header. */
> > + if (!InsertUndoBytes((char *) &ucontext->urec_txn,
> > + SizeOfUndoRecordTransaction,
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_RMID;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_RMID:
> > + /* Write rmid(if needed and not already done). */
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_RMID) != 0)
> > + {
> > + if (!InsertUndoBytes((char *) &(ucontext->urec_rmid), sizeof(RmgrId),
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_RELOID;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_RELOID:
> > + /* Write reloid(if needed and not already done). */
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_RELOID) != 0)
> > + {
> > + if (!InsertUndoBytes((char *) &(ucontext->urec_reloid), sizeof(Oid),
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_XID;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_XID:
> > + /* Write xid(if needed and not already done). */
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_XID) != 0)
> > + {
> > + if (!InsertUndoBytes((char *) &(ucontext->urec_fxid), sizeof(FullTransactionId),
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_CID;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_CID:
> > + /* Write cid(if needed and not already done). */
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_CID) != 0)
> > + {
> > + if (!InsertUndoBytes((char *) &(ucontext->urec_cid), sizeof(CommandId),
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_FORKNUM;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_FORKNUM:
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_FORK) != 0)
> > + {
> > + /* Insert undo record fork number. */
> > + if (!InsertUndoBytes((char *) &ucontext->urec_fork,
> > + sizeof(ForkNumber),
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_PREVUNDO;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_PREVUNDO:
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_PREVUNDO) != 0)
> > + {
> > + /* Insert undo record blkprev. */
> > + if (!InsertUndoBytes((char *) &ucontext->urec_prevundo,
> > + sizeof(UndoRecPtr),
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_BLOCK;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_BLOCK:
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_BLOCK) != 0)
> > + {
> > + /* Insert undo record block header. */
> > + if (!InsertUndoBytes((char *) &ucontext->urec_blk,
> > + SizeOfUndoRecordBlock,
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_LOGSWITCH;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_LOGSWITCH:
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_LOGSWITCH) != 0)
> > + {
> > + /* Insert undo record transaction header. */
> > + if (!InsertUndoBytes((char *) &ucontext->urec_logswitch,
> > + SizeOfUndoRecordLogSwitch,
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_PAYLOAD;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_PAYLOAD:
> > + if ((ucontext->urec_hd.urec_info & UREC_INFO_PAYLOAD) != 0)
> > + {
> > + /* Insert undo record payload header. */
> > + if (!InsertUndoBytes((char *) &ucontext->urec_payload,
> > + SizeOfUndoRecordPayload,
> > + &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_PAYLOAD_DATA;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_PAYLOAD_DATA:
> > + {
> > + int len = ucontext->urec_payload.urec_payload_len;
> > +
> > + if (len > 0)
> > + {
> > + /* Insert payload data. */
> > + if (!InsertUndoBytes((char *) ucontext->urec_payloaddata,
> > + len, &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_TUPLE_DATA;
> > + }
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_TUPLE_DATA:
> > + {
> > + int len = ucontext->urec_payload.urec_tuple_len;
> > +
> > + if (len > 0)
> > + {
> > + /* Insert tuple data. */
> > + if (!InsertUndoBytes((char *) ucontext->urec_tupledata,
> > + len, &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > + }
> > + ucontext->stage = UNDO_PACK_STAGE_UNDO_LENGTH;
> > + }
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_UNDO_LENGTH:
> > + /* Insert undo length. */
> > + if (!InsertUndoBytes((char *) &ucontext->undo_len,
> > + sizeof(uint16), &writeptr, endptr,
> > + &ucontext->already_processed,
> > + &ucontext->partial_bytes))
> > + return;
> > +
> > + ucontext->stage = UNDO_PACK_STAGE_DONE;
> > + /* fall through */
> > +
> > + case UNDO_PACK_STAGE_DONE:
> > + /* Nothing to be done. */
> > + break;
> > +
> > + default:
> > + Assert(0); /* Invalid stage */
> > + }
> > +}
>
> I don't understand. The only purpose of this is that we can partially
> write a packed-but-not-actually-packed record onto a bunch of pages? And
> for that we have an endless chain of copy and pasted code calling
> InsertUndoBytes()? Copying data into shared buffers in tiny increments?
>
> If we need to this, what is the whole packed record format good for?
> Except for adding a bunch of functions with 10++ ifs and nearly
> identical code?
>
> Copying data is expensive. Copying data in tiny increments is more
> expensive. Copying data in tiny increments, with a bunch of branches, is
> even more expensive. Copying data in tiny increments, with a bunch of
> branches, is even more expensive, especially when it's shared
> memory. Copying data in tiny increments, with a bunch of branches, is
> even more expensive, especially when it's shared memory, especially when
> all that shared meory is locked at once.
>
>
> > +/*
> > + * Read the undo record from the input page to the unpack undo context.
> > + *
> > + * Caller can call this function multiple times until desired stage is reached.
> > + * This will read the undo record from the page and store the data into unpack
> > + * undo context, which can be later copied to unpacked undo record by calling
> > + * FinishUnpackUndo.
> > + */
> > +void
> > +UnpackUndoData(UndoPackContext *ucontext, Page page, int starting_byte)
> > +{
> > + char *readptr = (char *) page + starting_byte;
> > + char *endptr = (char *) page + BLCKSZ;
> > +
> > + switch (ucontext->stage)
> > + {
> > + case UNDO_PACK_STAGE_HEADER:
>
> You know roughly what I'm thinking.

I have expressed my thought on this in last comment.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2019-08-13 12:11:07 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Dilip Kumar 2019-08-13 11:35:27 Re: POC: Cleaning up orphaned files using undo logs