Re: POC: Cleaning up orphaned files using undo logs

From: Antonin Houska <ah(at)cybertec(dot)at>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2020-11-12 10:17:36
Message-ID: 92024.1605176256@antos
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Thu, Nov 28, 2019 at 3:45 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> > On Tue, Sep 17, 2019 at 10:03:20AM +1200, Thomas Munro wrote:
> > > Oops, right. So it should just be added to the if condition. Will do.
> >
> > It's been a couple of months and the discussion has stale. It seems
> > also that the patch was waiting for an update. So I am marking it as
> > RwF for now. Please feel free to update it if you feel that's not
> > adapted.
> Thanks. We decided to redesign a couple of aspects of the undo
> storage and record layers that this patch was intended to demonstrate,
> and work on that is underway. More on that soon.

As my boss expressed in his recent blog post, we'd like to contribute to the
zheap development, and a couple of developers from other companies are
interested in this as well. Amit Kapila suggested that the "cleanup of
orphaned files" feature is a good start point in getting the code into PG
core, so I've spent some time on it and tried to rebase the patch set.

In fact what I did is not mere rebasing against the current master branch -
I've also (besides various bug fixes) done some design changes.

Incorporated the new Undo Record Set (URS) infrastructure

This is also pointed out in [0].

I started from [1] and tried to implement some missing parts (e.g. proper
closing of the URSs after crash), introduced UNDO_DEBUG preprocessor macro
which makes the undo log segments very small and fixed some bugs that the
small segments exposed.

The most significant change I've done was removal of the undo requests from
checkpoint. I could not find any particular bug / race conditions related to
including the requests into the checkpoint, but I concluded that it's easier
to think about consistency and checkpoint timings if we scan the undo log on
restart (after recovery has finished) and create the requests from scratch.

[2] shows where I ended up before I started to rebase this patchset.

No background undo

Reduced complexity of the patch seems to be the priority at the moment. Amit
suggested that cleanup of an orphaned relation file is simple enough to be
done on foreground and I agree.

"undo worker" is still there, but it only processes undo requests after server
restart because relation data can only be changed in a transaction - it seems
cleaner to launch a background worker for this than to hack the startup

Since the concept of undo requests is closely related to the undo worker, I
removed undorequest.c too. The new (much simpler) undo worker gets the
information on incomplete / aborted transactions from the undo log as
mentioned above.

SMGR enhancement

I used the 0001 patch from [3] rather than [4], although it's more invasive
because I noticed somewhere in the discussion that there should be no reserved
database OID for the undo log. (InvalidOid cannot be used because it's already
in use for shared catalogs.)

Components added

pg_undo_dump utility and test framework for undoread.c. BTW, undoread.c seems
to need some refactoring.

Following are a few areas which are not implemented yet because more
discussion is needed there:


There's no discard worker for the URS infrastructure yet. I thought about
discarding the undo log during checkpoint, but checkpoint should probably do
more straightforward tasks than the calculation of a new discard pointer for
each undo log, so a background worker is needed. A few notes on that:

* until the zheap AM gets added, only the transaction that creates the undo
records needs to access them. This assumption should make the discarding
algorithm a bit simpler. Note that with zheap, the other transactions need
to look for old versions of tuples, so the concept of oldestXidHavingUndo
variable is needed there.

* it's rather simple to pass pointer the URS pointer to the discard worker
when transaction either committed or the undo has been executed. If the
URS only consists of one chunk, the discard pointer can simply be advanced
to the end of the chunk. But if there are multiple chunks, the discard
worker might need to scan quite some amount of the undo log because (IIUC)
chunks of different URSs can be interleaved (if there's not enough space
for a record in the log 1, log 2 is used, but before we get to discarding,
another transaction could have added its chunk to the log 1) and because
the chunks only contain links backwards, not forward. If we added the
forward link to the chunk header, it would make chunk closing more

How about storing the type header (which includes XID) in each chunk
instead of only the first chunk of the URS? Thus we'd be able to check for
each chunk separately whether it can be discarded.

* if the URS belongs to an aborted transaction or a transaction that could
not finish due to server crash, the transaction status alone does not
justify discarding: we also need to be sure that the underlying undo
records have been applied. So if we want to do without the
oldestXidHavingUndo variable, some sort of undo progress tracking is
needed, see below.

Do not execute the same undo record multiple times

Although I've noticed in the zheap code that it checks whether particular undo
action was already undone, I think this functionality fits better in the URS
layer. Also note in [1] (i.e. the undo layer, no zheap) that the header
comment of AtSubAbort_XactUndo() refers to this problem.

I've tried to implement such a thing (not included in this patch) by adding
last_rec_applied field to UndoRecordSetChunkHeader. When the UNDO stage of the
transaction starts, this field is set to the last undo record of given chunk,
and once that record is applied, the pointer moves to the previous record in
terms of undo pointer (i.e. the next record to be applied - the records are
applied in reverse order) and so on. For recovery purposes, the pointer is
maintained in a similar way as the ud_insertion_point field of
UndoPageHeaderData. However, although I haven't tested performance yet, I
wonder if it's o.k. to lock the buffer containing the chunk header exclusively
for each undo record execution. I wonder if there's a better place to store
the progress information, maybe at page level?

I can spend more time on this project, but need a hint which part I should
focus on. Other hackers might have the same problem. Thanks for any


Antonin Houska

Attachment Content-Type Size
undo-20201112.tgz application/x-gzip 163.1 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2020-11-12 10:20:36 Re: Asynchronous Append on postgres_fdw nodes.
Previous Message Etsuro Fujita 2020-11-12 10:16:42 Re: Asynchronous Append on postgres_fdw nodes.