Re: POC: Cleaning up orphaned files using undo logs

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: POC: Cleaning up orphaned files using undo logs
Date: 2019-03-12 13:20:29
Message-ID: CA+hUKGKN42jB+ubCKru716HPtMbahdia39GwG5pLgWLMZ_y1ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 3, 2019 at 11:09 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2018-12-03 18:43:04 +1300, Thomas Munro wrote:
> > Sorry for my silence... I got stuck on a design problem with the lower
> > level undo log management code that I'm now close to having figured
> > out. I'll have a new patch soon.

Hello all,

Here's a new WIP version of this patch set. It builds on a fairly
deep stack of patches being developed by several people. As mentioned
before, it's a useful crash-test dummy for a whole stack of technology
we're working on, but it's also aiming to solve a real problem.

It currently fails in one regression test for a well understood
reason, fix on the way (see end), and there are some other stability
problems being worked on.

Here's a quick tour of the observable behaviour, having installed the
pg_buffercache and test_undorecord extensions:

==================

postgres=# begin;
BEGIN
postgres=# create table foo ();
CREATE TABLE

Check if our transaction has generated undo data:

postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs ;
logno | discard | insert | xid | pid
-------+------------------+------------------+-----+-------
0 | 0000000000002CD9 | 0000000000002D1A | 476 | 39169
(1 row)

Here, we see that undo log number 0 has some undo data because discard
< insert. We can find out what it says:

postgres=# call dump_undo_records(0);
NOTICE: 0000000000002CD9: Storage: CREATE dbid=12916, tsid=1663,
relfile=16386; xid=476, next xact=0
CALL

The undo record shown there lives in shared buffers, and we can see
that it's in there with pg_buffercache (the new column smgrid 1 means
undo data; 0 is regular relation data):

postgres=# select bufferid, smgrid, relfilenode, relblocknumber,
isdirty, usagecount from pg_buffercache where smgrid = 1;
bufferid | smgrid | relfilenode | relblocknumber | isdirty | usagecount
----------+--------+-------------+----------------+---------+------------
3 | 1 | 0 | 1 | t | 5
(1 row)

Even though that's just a dirty page in shared buffers, if we crash
now and recover, it'll be recreated by a new WAL record that was
flushed *before* creating the relation file. We can see that with
pg_waldump:

rmgr: Storage ... PRECREATE base/12916/16384, blkref #0: smgr 1 rel
1663/0/0 blk 1 FPW
rmgr: Storage ... CREATE base/12916/16384

The PRECREATE record dirtied block 1 of undo log 0. In this case it
happened to include a FPW of the undo log page too, following the
usual rules. FPWs are rare for undo pages because of the
REGBUF_WILL_INIT optimisation that applies to the zeroed out pages
(which is most undo pages, due to the append-mostly access pattern).

Finally, we if commit we see the undo data is discarded by a
background worker, and if we roll back explicitly or crash and run
recovery, the file is unlinked. Here's an example of the crash case:

postgres=# begin;
BEGIN
postgres=# create table foo ();
CREATE TABLE
postgres=# select relfilenode from pg_class where relname = 'foo';
relfilenode
-------------
16395
(1 row)

postgres=# select pg_backend_pid();
pg_backend_pid
----------------
39169
(1 row)

$ kill -9 39169

... server restarts, recovers ...

$ ls pgdata/base/12916/16395
pgdata/base/12916/16395

It's still there, though it's been truncated by an undo worker (see
end of email). And finally, after the next checkpoint:

$ ls pgdata/base/12916/16395
ls: pgdata/base/12916/16395: No such file or directory

That's the end of the quick tour.

Most of these patches should probably be discussed in other threads,
but I'm posting a snapshot of the full stack here anyway. Here's a
patch-by-patch summary:

=== 0001 "Refactor the fsync mechanism to support future SMGR
implementations." ===

The 0001 patch has its own CF thread
https://commitfest.postgresql.org/22/1829/ and is from Shawn Debnath
(based on earlier work by me), but I'm including a copy here for
convenience/cfbot.

=== 0002 "Add SmgrId to smgropen() and BufferTag." ===

This is new, and is based on the discussion from another recent
thread[1] about how we should identify buffers belonging to different
storage managers. In earlier versions of the patch-set I had used a
special reserved DB OID for undo data. Tom Lane didn't like that idea
much, and Anton Shyrabokau (via Shawn Debnath) suggested making
ForkNumber narrower so we can add a new field to BufferTag, and Andres
Freund +1'd my proposal to add the extra value as a parameter to
smgropen(). So, here is a patch that tries those ideas.

Another way to do this would be to widen RelFileNode instead, to avoid
having to pass around the SMGR ID separately in various places.
Looking at the number of places that have to chance, you can probably
see why we wanted to use a magic DB OID instead, and I'm not entirely
convinced that it wasn't better that way, or that I've found all the
places that need to carry an smgrid alongside a RelFileNode.

Archeological note: smgropen() was like that ~15 years ago before
commit 87bd9563, but buffer tags didn't include the SMGR ID.

I decided to call md.c's ID "SMGR_RELATION", describing what it really
holds -- regular relations -- rather than perpetuating the doubly
anachronistic "magnetic disk" name.

While here, I resurrected the ancient notion of a per-SMGR 'open'
routine, so that a small amount of md.c-specific stuff could be kicked
out of smgr.c and future implementations can do their own thing here
too.

While doing that work I realised that at least pg_rewind needs to
learn about how different storage managers map blocks to files, so
that's a new TODO item requiring more thought. I wonder what other
places know how to map { RelFileNode, ForkNumber, BlockNumber } to a
path + offset, and I wonder what to think about the fact that some of
them may be non-backend code...

=== 0003 "Add undo log manager." ===

This and the next couple of patches live in CF thread
https://commitfest.postgresql.org/22/1649/ but here's a much newer
snapshot that hasn't been posted there yet.

Manages a set of undo logs in shared memory, manages undo segment
files, tracks discard, insert, end pointers visible in
pg_stat_undo_logs. With this patch you can allocate and discard space
in undo logs using the UndoRecPtr type to refer to addresses, but
there is no access to the data yet. Improvements since the last
version are not requiring DSM segments, proper FPW support and reduced
WAL traffic. Previously there were extra per-xact and per-checkpoint
records requiring retry-loops in code that inserted undo data.

=== 0004 "Provide access to undo log data via the buffer manager." ===

Provide SMGR_UNDO. While the 0003 patch deals with allocating and
discarding undo address space and makes sure that backing files exist,
this patch lets you read and write buffered data in them.

=== 0005 "Allow WAL record data on first modification after a checkpoint." ===

Provide a way for data to be attached to a WAL-registered block that
is only included if this turns out to be the first WAL record that
touches the block after a checkpoint. This is a bit like FPW images,
except that it's arbitrary extra data and happens even if FPW is off.
This is used to capture a copy of the (tiny) undo log meta-data
(primary the insertion pointer) to fix a consistency problem when
recovering from an online checkpoint.

=== 0006 + 0007 "Provide interfaces to store and fetch undo records." ===

This is a snapshot of work by my colleagues Dilip, Rafia and others
based on earlier prototyping by Robert. While the earlier patches
give you buffered binary undo data, this patch introduces the concept
of high level undo records that can be inserted, and read back given
an UndoRecPtr. This is a version presented on another thread already;
here it's lightly changed due to rebasing by me.

Undo-aware modules should design a set of undo record types, and
insert exactly the same ones at do and undo time.

The 0007 patch is fixups from me to bring that code into line with
changes to the lower level patches. Future versions will be squashed
and tidied up; still working on that.

=== 0008 + 0009 "Undo worker and transaction rollback" ===

This has a CF thread at https://commitfest.postgresql.org/22/1828/ and
again this is a snapshot of work from Dilip, Rafia and others, with a
fixup from me. Still working on coordinating that for the next
version.

This provides a way for RMGR modules to register a callback function
that will receive all the undo records they inserted during a given
[sub]transaction if it rolls back. It also provides a system of
background workers that can execute those undo records in case the
rollback happens after crash recovery, or in case the work can be
usefully pushed into the background during a regular online rollback.
This is a complex topic and I'm not attempting to explain it here.

There are a few known problems with this and Dilip is working on a
more sophisticated worker management system, but I'll let him write
about that, over in that other thread.

I think it'd probably be a good idea to split this patch into two or
three; the RMGR undo support, the xact.c integration and the worker
machinery. But maybe that's just me.

Archeological note: XXXX_undo() callback functions registered via
rmgrlist.h a bit like this originally appeared in the work by Vadim
Mikheev (author of WAL) in commit b58c0411bad4, but that was
apparently never completed once people figured out that you can make a
force, steal, redo, no-undo database work (curiously I saw a slide
from a university lecture somewhere saying that would be impossible).
The stub functions were removed from the tree in 4c8495a1. Our new
work differs from Vadim's original vision by putting undo data in a
separate place from the WAL, and accessing it via shared buffers. I
guess that might be because Vadim planned to use undo for rollback
only, not for MVCC (but I might be wrong about that). That difference
might explains why eg Vadim's function heap_undo() took an XLogRecord,
whereas our proposal takes a different type. Our proposal also passes
more than one records at a time to the undo handler; in future this
will allow us to collect up all undo records relating to a page of
(eg) zheap, and process them together for mechanical sympathy.

=== 0010 "Add developer documentation for the undo log storage subsystem." ===

Updated based on Robert's review up-thread. No coverage of background
workers yet -- that is under development.

=== 0011 "Add user-facing documentation for undo logs." ===

Updated based on Robert's review up-thread.

=== 0012 "Add test_undorecord test module." ===

Provides quick and dirty dump_undo_records() procedure for testing.

=== 0013 "Use undo-based rollback to clean up files on abort." ===

Finally, this is the actual feature that this CF item is about. The
main improvement here is that the previous version unlinked files
immediately when executing undo actions, which broke the protocol
established by commit 6cc4451b, namely that you can't reuse a
relfilenode until after the next checkpoint, and the existence of an
(empty) first relation segment in the filesystem is the only thing
preventing that. That is fixed in this version (but see problem 2
below).

Known problems:

1. A couple of tests fail with "ERROR: buffer is pinned in
InvalidateBuffer". That's because ROLLBACK TO SAVEPOINT is executing
the undo actions that drop the buffers for a newly created table
before the subtransaction has been cleaned up. Amit is working on a
solution to that. More soon.

2. The are two levels of deferment of file unlinking in current
PostgreSQL. First, when you create a new relation, it is pushed on
pendingDeletes; this patch-set replaces that in-memory list with
persistent undo records as discussed. There is a second level of
deferment: we unlink all the segments of the file except the first
one, which we truncate, and then finally the zero-length file is
unlinked after the next checkpoint; this is an important part of
PostgreSQL's protocol for not reusing relfilenodes too soon. That
means that there is still a very narrow window after the checkpoint is
logged but before we've unlinked that file where you could still crash
and leak a zero-length file. I've thought about a couple of solutions
to close that window, including a file renaming scheme where .zombie
files get cleaned up on crash, but that seemed like something that
could be improved later.

There is something else that goes wrong under parallel make check,
which I must have introduced recently but haven't tracked down yet. I
wanted to post a snapshot version for discussion anyway. More soon.

This code is available at https://github.com/EnterpriseDB/zheap/tree/undo.

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BDE0mmiBZMtZyvwWtgv1sZCniSVhXYsXkvJ_Wo%2B83vvw%40mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

Attachment Content-Type Size
undo-smgr-v2.tgz application/x-gzip 153.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jesper Pedersen 2019-03-12 13:23:25 Re: speeding up planning with partitions
Previous Message Sergei Kornilov 2019-03-12 13:19:27 Re: pg_stat_statements vs. SELECT FOR UPDATE