Re: POC: Cleaning up orphaned files using undo logs

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-07-30 06:50:29
Message-ID: CA+hUKGJz5xtLrcZE418_Xat8o7T1Wy6D5bK8ZMX1QrTqQbJT-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 30, 2019 at 5:03 PM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> On Fri, Jul 19, 2019 at 2:28 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > On Thu, Jul 11, 2019 at 9:17 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > On Thu, Jul 11, 2019 at 12:38 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > > > I don't like the fact that undoaccess.c has a new global,
> > > > undo_compression_info. I haven't read the code thoroughly, but do we
> > > > really need that? I think it's never modified (so it could just be
> > > > declared const),
> > >
> > > Actually, this will get modified otherwise across undo record
> > > insertion how we will know what was the values of the common fields in
> > > the first record of the page. Another option could be that every time
> > > we insert the record, read the value from the first complete undo
> > > record on the page but that will be costly because for every new
> > > insertion we need to read the first undo record of the page.
> > >
> >
> > This information won't be shared across transactions, so can't we keep
> > it in top transaction's state? It seems to me that will be better
> > than to maintain it as a global state.
>
> I think this idea is good for the DO time but during REDO time it will
> not work as we will not have the transaction state. Having said that
> the current idea of keeping in the global variable will also not work
> during REDO time because the WAL from the different transaction can be
> interleaved. There are few ideas to handle this issue
>
> 1. At DO time keep in TopTransactionState as you suggested and during
> recovery time read from the first complete record on the page.
> 2. Just to keep the code uniform always read from the first complete
> record of the page.
>
> After putting more though I am more inclined towards idea-2. Because
> we are anyway inserting our current record into that page basically we
> have read the buffer and also holds the exclusive lock on the buffer.
> So reading a few extra bytes from the buffer will not hurt us IMHO.
>
> If someone has a better solution please suggest.

Hi Dilip,

Here's some initial review of the following patch (from your public
undo_interface_v1 branch as of this morning). I haven't tested this
version yet, because my means of testing this stuff involves waiting
for undoprocessing to be rebased, so that I can test it with my
orphaned files stuff and other test programs. It contains another
suggestion for that problem you just mentioned (and also me pointing
out what you just pointed out, since I wrote it earlier) though I'm
not sure if it's better than your options above.

> commit 2f3c127b9e8bc7d27cf7adebff0a355684dfb94e
> Author: Dilip Kumar <dilipkumar(at)localhost(dot)localdomain>
> Date: Thu May 2 11:28:13 2019 +0530
>
> Provide interfaces to store and fetch undo records.

+#include "commands/tablecmds.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"

"miscadmin.h" comes before "storage...".

+/*
+ * Compute the size of the partial record on the undo page.
+ *
+ * Compute the complete record size by uur_info and variable field length
+ * stored in the page header and then subtract the offset of the record so that
+ * we can get the exact size of partial record in this page.
+ */
+static inline Size
+UndoPagePartialRecSize(UndoPageHeader phdr)
+{
+ Size size;

We decided to use size_t everywhere in new code (except perhaps
functions conforming to function pointer types that historically use
Size in their type).

+ /*
+ * Compute the header size from undo record uur_info, stored in the page
+ * header.
+ */
+ size = UndoRecordHeaderSize(phdr->uur_info);
+
+ /*
+ * Add length of the variable part and undo length. Now, we know the
+ * complete length of the undo record.
+ */
+ size += phdr->tuple_len + phdr->payload_len + sizeof(uint16);
+
+ /*
+ * Subtract the size which is stored in the previous page to get the
+ * partial record size stored in this page.
+ */
+ size -= phdr->record_offset;
+
+ return size;

This is probably a stupid question but why isn't it enough to just
store the offset of the first record that begins on this page, or 0
for none yet? Why do we need to worry about the partial record's
payload etc?

+UndoRecPtr
+PrepareUndoInsert(UndoRecordInsertContext *context,
+ UnpackedUndoRecord *urec,
+ Oid dbid)
+{
...
+ /* Fetch compression info for the transaction. */
+ compression_info = GetTopTransactionUndoCompressionInfo(category);

How can this work correctly in recovery? [Edit: it doesn't, as you
just pointed out]

I had started reviewing an older version of your patch (the version
that had made it as far as the undoprocessing branch as of recently),
before I had the bright idea to look for a newer version. I was going
to object to the global variable you had there in the earlier version.
It seems to me that you have to be able to reproduce the exact same
compression in recovery that you produced as "do" time, no? How can
TopTranasctionStateData be the right place for this in recovery?

One data structure that could perhaps hold this would be
UndoLogTableEntry (the per-backend cache, indexed by undo log number,
with pretty fast lookups; used for things like
UndoLogNumberGetCategory()). As long as you never want to have
inter-transaction compression, that should have the right scope to
give recovery per-undo log tracking. If you ever wanted to do
compression between transactions too, maybe UndoLogSlot could work,
but that'd have more complications.

+/*
+ * Read undo records of the transaction in bulk
+ *
+ * Read undo records between from_urecptr and to_urecptr until we exhaust the
+ * the memory size specified by undo_apply_size. If we could not read all the
+ * records till to_urecptr then the caller should consume current set
of records
+ * and call this function again.
+ *
+ * from_urecptr - Where to start fetching the undo records.
If we can not
+ * read all the records because of memory limit then this
+ * will be set to the previous undo record
pointer from where
+ * we need to start fetching on next call.
Otherwise it will
+ * be set to InvalidUndoRecPtr.
+ * to_urecptr - Last undo record pointer to be fetched.
+ * undo_apply_size - Memory segment limit to collect undo records.
+ * nrecords - Number of undo records read.
+ * one_page - Caller is applying undo only for one block not for
+ * complete transaction. If this is set true then instead
+ * of following transaction undo chain using
prevlen we will
+ * follow the block prev chain of the block so that we can
+ * avoid reading many unnecessary undo records of the
+ * transaction.
+ */
+UndoRecInfo *
+UndoBulkFetchRecord(UndoRecPtr *from_urecptr, UndoRecPtr to_urecptr,
+ int undo_apply_size, int *nrecords, bool one_page)

Could you please make it clear in comments and assertions what the
relation between from_urecptr and to_urecptr is and what they mean
(they must be in the same undo log, one must be <= the other, both
point to the *start* of a record, so it's not the same as the total
range of undo)?

undo_apply_size is not a good parameter name, because the function is
useful for things other than applying records -- like the
undoinspect() extension (or some better version of that), for example.
Maybe max_result_size or something like that?

+{
...
+ /* Allocate memory for next undo record. */
+ uur = palloc0(sizeof(UnpackedUndoRecord));
...
+
+ size = UnpackedUndoRecordSize(uur);
+ total_size += size;

I see, so the unpacked records are still allocated one at a time. I
guess that's OK for now. From some earlier discussion I had been
expecting an arrangement where the actual records were laid out
contiguously with their subcomponents (things they point to in
palloc()'d memory) nearby.

+static uint16
+UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
+ UndoLogCategory category)
+{
...
+ char prevlen[2];
...
+ prev_rec_len = *(uint16 *) (prevlen);

I don't think that's OK, and might crash on a non-Intel system. How
about using a union of uint16 and char[2]?

+ /* Copy undo record transaction header if it is present. */
+ if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+ memcpy(&ucontext->urec_txn, uur->uur_txn, SizeOfUndoRecordTransaction);

I was wondering why you don't use D = S instead of mempcy(&D, &S,
size) wherever you can, until I noticed you use these SizeOfXXX macros
that don't include trailing padding from structs, and that's also how
you allocate objects. Hmm. So if I were to complain about you not
using plain old assignment whenever you can, I'd also have to complain
about that.

I think that that technique of defining a SizeOfXXX macro that
excludes trailing bytes makes sense for writing into WAL or undo log
buffers using mempcy(). I'm not sure it makes sense for palloc() and
copying into typed variables like you're doing here and I think I'd
prefer the notational simplicity of using the (very humble) type
system facilities C gives us. (Some memory checker might not like it
you palloc(the shorter size) and then use = if the compiler chooses to
implement it as memcpy sizeof().)

+/*
+ * The below common information will be stored in the first undo record of the
+ * page. Every subsequent undo record will not store this information, if
+ * required this information will be retrieved from the first undo
record of the
+ * page.
+ */
+typedef struct UndoCompressionInfo

Shouldn't this say "Every subsequent record will not store this
information *if it's the same as the relevant fields in the first
record*"?

+#define UREC_INFO_TRANSACTION 0x001
+#define UREC_INFO_RMID 0x002
+#define UREC_INFO_RELOID 0x004
+#define UREC_INFO_XID 0x008

Should we call this UREC_INFO_FXID, since it refers to a FullTransactionId?

+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info. All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */

I think you mean "All structures are packed into undo pages without
considering alignment and without trailing padding bytes"? This comes
from the definition of the SizeOfXXX macros IIUC. There might still
be padding between members of some of those structs, no? Like this
one, that has the second member at offset 2 on my system:

+typedef struct UndoRecordHeader
+{
+ uint8 urec_type; /* record type code */
+ uint16 urec_info; /* flag bits */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader \
+ (offsetof(UndoRecordHeader, urec_info) + sizeof(uint16))

+/*
+ * Information for a transaction to which this undo belongs. This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+ /*
+ * Undo block number where we need to start reading the undo for applying
+ * the undo action. InvalidBlockNumber means undo applying hasn't
+ * started for the transaction and MaxBlockNumber mean undo completely
+ * applied. And, any other block number means we have applied partial undo
+ * so next we can start from this block.
+ */
+ BlockNumber urec_progress;
+ Oid urec_dbid; /* database id */
+ UndoRecPtr urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;

I propose that we rename this to UndoRecordGroupHeader (or something
like that... maybe "Set", but we also use "set" as a verb in various
relevant function names):

1. We'll also use these for the new "shared" records we recently
invented that don't relate to a transaction. This is really about
defining the unit of discarding; we throw away the whole set of
records at once, which is why it's basically about proividing a space
for "urec_next".

2. Though it also holds rollback progress information, which is a
transaction-specific concept, there can be more than one of these sets
of records for a single transaction anyway. A single transaction can
write undo stuff in more than one undo log (different categories
perm/temp/unlogged/shared and also due to log switching when they are
full).

So really it's just a header for an arbitrary set of records, used to
track when and how to discard them.

If you agree with that idea, perhaps urec_next should become something
like urec_next_group, too. "next" is a bit vague, especially for
something as untyped as UndoRecPtr: someone might think it points to
the next record.

More soon.

--
Thomas Munro
https://enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2019-07-30 07:20:53 Re: partition routing layering in nodeModifyTable.c
Previous Message Michael Paquier 2019-07-30 06:42:39 Re: allow online change primary_conninfo