Re: POC: Cleaning up orphaned files using undo logs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-08-14 06:57:45
Message-ID: 20190814065745.2faw3hirvfhbrdwe@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-08-06 14:18:42 -0700, Andres Freund wrote:
> Here's the last section of my low-leve review. Plan to write a higher
> level summary afterwards, now that I have a better picture of the code.

General comments:

- For a feature of this complexity, there's very little architectural
documentation. Some individual parts have a bit, but there's basically
nothing over-arching. That makes it extremely hard for anybody that is
not already involved to understand the design constraints, and even
for people involved it's hard to understand.

I think it's very important for this to have a document that first
explains what the goals, and non-goals, of this feature are. And then
secondly explains the chosen architecture referencing those
constraints. Once that's there it's a lot easier to review this
patchset, to discuss the overall architecture, etc.

- There are too many different email threads and git branches. The same
patches are discussed in different threads, layers exist in slightly
diverging versions in different git trees. Again making it very hard
for anybody not primarily focussing on undo to join the discussion.

I think most of the older git branches should be renamed into
something indicating their historic status. The remaining branches
should be referenced from a wiki page (linked to in each submission of
a new patch version), explaining what they're about. I don't think
it's realistic to have people infer meaning from the current branch
names (undo, proposal-undo-log, undo-log-storage, undo-log-storage-v2,
undo_interface_v1, undoprocessing).

Given the size of the the overall project it's quite possibly not
realistic to manage the whole work in a single git branch. With
separate git branches, as currently done, it's however hard to
understand which version of a what layer is used. I think at the very
least higher layers need to indicate the version of the underlying
layers is being used. I suggest just adding a "v01: " style prefix to
all commit subjects a branch is rebased onto.

It's also currently hard to understand what version of a layer is
being discussed. I think submissions all need to include a version
number (c.f. git format-patch's -v option), and that version ought to
be included in the subject line. Each new major version of a patch
should be started as a reply to the first message of a thread, to
keep the structure of a discussion in a managable shape. New versions
should include explicit explanations about the major changes compared
to the last version.

- The code quality of pretty significant parts of the patchset is not
even close to being good enough. There are areas with a lot of code
duplication. There are very few higher level explanations for
interfaces. There's a lot of "i++; /* increment i to increment it */"
style comments, but not enough higher level comments. There are
significant issues with parts of the code that aren't noted anywhere
in comments, leading to reviewers having to repeatedly re-discover
them (and wasting time on that).

There's different naming styles in related code without a discernible
pattern (e.g. UndoRecordSetInfo being followed by
get_undo_rec_cid_offset). The word-ordering of different layers is
confusing (e.g. BeginUndoRecordInsert vs UndoLogBeginInsert vs
PrepareUndoInsert). Different important externally visible functions
have names that don't allow to determine which is supposed to do what
(PrepareUndoInsert vs BeginUndoRecordInsert).

More specific comments:

- The whole sequencing of undo record insertion in combination with WAL
logging does not appear to be right. It's a bit hard to say, because
there's very little documentation on what the intended default
sequence of operations is.

My understanding is that the currently expected pattern is to:

1) Collect information / perform work needed to perform the action
that needs to be UNDO logged. E.g. perform visibility
determinations, wait for lockers, compute new infomask, etc.

This will likely end with the "content" page(s) (e.g. a table's
page) being exclusively locked.

2) Estimate space needed for all UNDO logging (BeginUndoRecordInsert)

3) Prepare for each undo record, this includes building the
content for each undo record. PrepareUndoInsert(). This acquires,
pins and locks buffers for undo.

4) begin a critical section

5) modify content page, mark buffer dirty

6) write UNDO, using InsertPreparedUndo()

7) associate undo with WAL record (RegisterUndoLogBuffers)

8) write WAL

9) End critical section

But despite reading through the code, including READMEs, I'm doubtful
that's quite the intended pattern. It REALLY can't be right that one
needs to parse many function headers to figure out how the most basic
use of undo could possibly work. There needs to be very clear
documentation about how to write undo records. Functions that sound
like they're involved, need actually useful comments, rather than just
restatements of their function names (cf RegisterUndoLogBuffers,
UndoLogBuffersSetLSN, UndoLogRegister).

- I think there's two fairly fundamental, and related, problems with
the sequence outlined above:

- We can't search for target buffers to store undo data, while holding
the "table" page content locked. That can involve writing out
multiple pages till we find a usable victim buffer. That can take a
pretty long time. While that's happening the content page would
currently be locked. Note how e.g. heapam.c is careful to not hold
*any* content locks while potentially performing IO. I think the
current interface makes that hard.

The easy way to solve would be to require sites performing UNDO
logging to acquire victim pages before even acquiring other content
locks. Perhaps the better approach could be for the undo layer to
hold onto a number of clean buffers, and to keep the last page in an
already written to undo log pinned.

- We can't search for victim buffers for further undo data while
already holding other undo pages content locked. Doing so means that
while we're e.g. doing IO to clean out the new page, old undo data
on the previous page can't be read.

This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
pinning and locking buffers, it'd need to be split into two
operations. One that acquires buffers and pins them, and one that
locks them. I think it's quite possible that the locking operation
could just be delayed until InsertPreparedUndo(). But if we solve
the above problem, most of this might already be solved.

- To me the current split between the packed and unpacked UNDO record
formats makes very little sense, the reasoning behind having them is
poorly if at all documented, results in extremely verbose code, and
isn't extensible.

When preparing to insert an undo record the in-buffer size is computed
with UndoRecordHeaderSize() (needs to know about all optional data)
from within PrepareUndoInsert() (which has a bunch a bunch of
additional knowledge about the record format). Then during insertion
InsertPreparedUndo(), first copies the UnpackedUndoRecord into an
UndoPackContext (again needing ...), and then, via InsertUndoData(),
copies that in small increments into the respective buffers (again
needing knowledge about the complete record format, two copies
even). Beside the code duplication, that also means the memory copies
are very inefficient, because they're all done in tiny increments,
multiple times.

When reading undo it's smilar: UnpackUndoData(), again in small
chunks, reads the buffer data into an UndoPackContext (another full
copy of the unpacked record format). But then FinishUnpackUndo()
*again* copies all that data, into an actual UnpackedUndoRecord
(again, with a copy of the record format, albeit slightly different
looking).

I'm not convinced by Heikki's argument that we shouldn't have
structure within undo records. In my opinion that is a significant
weakness of how WAL was initially designed, and even after Heikki's
work, still is a problem. But this isn't the right design either.

Even if were to stay with the current fixed record format, I think
the current code needs a substantial redesign:

- I think 'packing' during insertion needs to serialize into a char*
allocation during PrepareUndoInsert computing the size in parallel
(or perhaps in InsertPreparedUndo, but probably not). The size of
the record should never be split across record boundaries
(i.e. we'll leave that space unused if we otherwise would need to
split the size). The actual insertion would be a maximally sized
memcpy() (so we'd as many memcpys as the buffer fits in, rather than
one for each sub-type of a record).

That allows to remove most of the duplicated knowledge of the record
format, and makes insertions faster (by doing only large memcpys
while holding exclusive content locks).

- When reading an undo record, the whole stage of UnpackUndoData()
reading data into a the UndoPackContext is omitted, reading directly
into the UnpackedUndoRecord. That removes one further copy of the
record format.

- To avoid having separate copies of the record format logic, I'd
probably encode it into *one* array of metadata. If we had
{offsetoff(UnpackedUndoRecord, member),
membersize(UnpackedUndoRecord, member),
flag}
we could fairly trivially remove most knowledge from the places
currently knowing about the record format.

I have some vague ideas for how to specify the format in a way that is
more extensible, but with more structure than just a blob of data. But
I don't think they're there yet.

- The interface to read undo also doesn't seem right to me. For one
there's various different ways to read records, with associated code
duplication (batch, batch in "one page" mode - but that's being
removed now I think, single record mode).

I think the batch mode is too restrictive. We might not need this
during the first merged version, but I think before long we're going
to want to be able to efficiently traverse all the undo chains we need
to determine the visibility of all tuples on a page. Otherwise we'll
cause a lot of additional synchronous read IO, and will repeatedly
re-fetch information, especially during sequential scans for an older
snapshot. I think I briefly outlined this in an earlier email - my
current though is that the batch interface (which the non-batch
interface should just be a tiny helper around), should basically be a
queue of "to-be-fetched" undo records. When batching reading an entire
transaction, all blocks get put onto that queue. When traversing
multiple chains, the chains are processed in a breadth-first fashion
(by just looking at the queue, and pushing additional work to the
end). That allows to efficiently issue prefetch requests for blocks to
be read in the near future.

I think that batch reading should just copy the underlying data into a
char* buffer. Only the records that currently are being used by
higher layers should get exploded into an unpacked record. That will
reduce memory usage quite noticably (and I suspect it also drastically
reduce the overhead due to a large context with a lot of small
allocations that then get individually freed). That will make the
sorting of undo a bit more CPU inefficient, because individual records
will need to be partially unpacked for comparison, but I think that's
going to be a far smaller loss than the win.

- My reading of the current xact.c integration is that it's not workable
as is. Undo is executed outside of a valid transaction state,
exceptions aren't properly undone, logic would need to be duplicated
to a significant degree, new kind of critical section.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-08-14 07:25:10 Re: Speed up transaction completion faster after many relations are accessed in a transaction
Previous Message Alex 2019-08-14 06:33:54 use valgrind --leak-check=yes to detect memory leak