POC: Cleaning up orphaned files using undo logs

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: POC: Cleaning up orphaned files using undo logs
Date: 2018-11-01 03:22:51
Message-ID: CAEepm=0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG+XpssKqmezug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

The following sequence creates an orphaned file:

BEGIN;
CREATE TABLE t ();
<kill -9 this backend>

Occasionally there are reports of systems that have managed to produce
a lot of them, perhaps through ENOSPC-induced panics, OOM signals or
buggy/crashing extensions etc. The most recent example I found in the
archives involved 1.7TB of unexpected files and some careful cleanup
work.

Relation files are created eagerly, and rollback is handled by pushing
PendingRelDelete objects onto the pendingDeletes list, to be discarded
on commit or processed on abort. That's effectively a kind of
specialised undo log, but it's in memory only, so it's less persistent
than the effects it is supposed to undo.

Here's a proof-of-concept patch that plugs the gap using the undo log
technology we're developing as part of the zHeap project. Note that
zHeap is not involved here: the SMGR module is acting as a direct
client of the undo machinery. Example:

postgres=# begin;
BEGIN
postgres=# create table t1 ();
CREATE TABLE
postgres=# create table t2 ();
CREATE TABLE

... now we can see that this transaction has some undo data (discard < insert):

postgres=# select logno, discard, insert, xid, pid from pg_stat_undo_logs;
logno | discard | insert | xid | pid
-------+------------------+------------------+-----+-------
0 | 00000000000021EF | 0000000000002241 | 581 | 18454
(1 row)

... and, if the test_undorecord module is installed, we can inspect
the records it holds:

postgres=# call dump_undo_records(0);
NOTICE: 0000000000002224: Storage: CREATE dbid=12655, tsid=1663, relfile=24594
NOTICE: 00000000000021EF: Storage: CREATE dbid=12655, tsid=1663, relfile=24591
CALL

If we COMMIT, the undo data is discarded by advancing the discard
pointer (tail) to match the insert pointer (head). If we ROLLBACK,
either explicitly or automatically by crashing and recovering, then
the files will be unlinked and the insert pointer will be rewound;
either way the undo log eventually finishes up "empty" again (discard
== insert). This is done with a system of per-rmgr-ID record types
and callbacks, similar to redo. The rollback action are either
executed immediately or offloaded to an undo worker process, depending
on simple heuristics.

Of course this isn't free, and the current patch makes table creation
slower. The goal is to make sure that there is no scenario (kill -9,
power cut etc) in which there can be a new relation file on disk, but
not a corresponding undo record that would unlink that file if the
transaction later has to roll back. Currently, that means that we
need to flush the WAL record that will create the undo record that
will unlink the file *before* we create the relation file. I suspect
that could be mitigated quite easily, by deferring file creation in a
backend-local queue until forced by access or commit. I didn't try to
do that in this basic version.

There are probably other ways to solve the specific problem of
orphaned files, but this approach is built on a general reusable
facility and I think it is a nice way to show the undo concepts, and
how they are separate from zheap. Specifically, it demonstrates the
more traditional of the two uses for undo logs: a reliable way to
track actions that must be performed on rollback. (The other use is:
seeing past versions of data, for vacuumless MVCC; that's a topic for
later).

Patches 0001-0006 are development snapshots of material posted on
other threads already[1][2], hacked around by me to make this possible
(see those threads for further developments in those patches including
some major strengthening work, coming soon). The subject of this
thread is 0007, the core of which is just a couple of hundred lines
written by me, based on an idea from Robert Haas.

Personally I think it'd be a good feature to get into PostgreSQL 12,
and I will add it to the CF that is about to start to seek feedback.
It passes make check on Unix and Windows, though currently it's
failing some of the TAP tests for reasons I'm looking into (possibly
due to bugs in the lower level patches, not sure).

Thanks for reading,

[1] https://www.postgresql.org/message-id/flat/CAEepm=2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ(at)mail(dot)gmail(dot)com
[2] https://www.postgresql.org/message-id/flat/CAFiTN-sYQ8r8ANjWFYkXVfNxgXyLRfvbX9Ee4SxO9ns-OBBgVA(at)mail(dot)gmail(dot)com

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
undo-smgr-v1.tgz application/x-gzip 111.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-11-01 03:35:22 Re: heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)
Previous Message Tom Lane 2018-11-01 03:17:35 Re: PG vs macOS Mojave