Hard limit on WAL space used (because PANIC sucks)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>, Peter Geoghegan <pg(at)heroku(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Hard limit on WAL space used (because PANIC sucks)
Date: 2013-06-06 14:00:30
Message-ID: 51B095FE.6050106@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

In the "Redesigning checkpoint_segments" thread, many people opined that
there should be a hard limit on the amount of disk space used for WAL:
http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6S4-PPd+6jiy4OQ78ihUA@mail.gmail.com.
I'm starting a new thread on that, because that's mostly orthogonal to
redesigning checkpoint_segments.

The current situation is that if you run out of disk space while writing
WAL, you get a PANIC, and the server shuts down. That's awful. We can
try to avoid that by checkpointing early enough, so that we can remove
old WAL segments to make room for new ones before you run out, but
unless we somehow throttle or stop new WAL insertions, it's always going
to be possible to use up all disk space. A typical scenario where that
happens is when archive_command fails for some reason; even a checkpoint
can't remove old, unarchived segments in that case. But it can happen
even without WAL archiving.

I've seen a case, where it was even worse than a PANIC and shutdown.
pg_xlog was on a separate partition that had nothing else on it. The
partition filled up, and the system shut down with a PANIC. Because
there was no space left, it could not even write the checkpoint after
recovery, and thus refused to start up again. There was nothing else on
the partition that you could delete to make space. The only recourse
would've been to add more disk space to the partition (impossible), or
manually delete an old WAL file that was not needed to recover from the
latest checkpoint (scary). Fortunately this was a test system, so we
just deleted everything.

So we need to somehow stop new WAL insertions from happening, before
it's too late.

Peter Geoghegan suggested one method here:
http://www.postgresql.org/message-id/flat/CAM3SWZQcyNxvPaskr-pxm8DeqH7_qevW7uqbhPCsg1FpSxKpoQ(at)mail(dot)gmail(dot)com(dot)
I don't think that exact proposal is going to work very well; throttling
WAL flushing by holding WALWriteLock in WAL writer can have knock-on
effects on the whole system, as Robert Haas mentioned. Also, it'd still
be possible to run out of space, just more difficult.

To make sure there is enough room for the checkpoint to finish, other
WAL insertions have to stop some time before you completely run out of
disk space. The question is how to do that.

A naive idea is to check if there's enough preallocated WAL space, just
before inserting the WAL record. However, it's too late to check that in
XLogInsert; once you get there, you're already holding exclusive locks
on data pages, and you are in a critical section so you can't back out.
At that point, you have to write the WAL record quickly, or the whole
system will suffer. So we need to act earlier.

A more workable idea is to sprinkle checks in higher-level code, before
you hold any critical locks, to check that there is enough preallocated
WAL. Like, at the beginning of heap_insert, heap_update, etc., and all
similar indexam entry points. I propose that we maintain a WAL
reservation system in shared memory. First of all, keep track of how
much preallocated WAL there is left (and try to create more if needed).
Also keep track of a different number: the amount of WAL pre-reserved
for future insertions. Before entering the critical section, increase
the reserved number with a conservative estimate (ie. high enough) of
how much WAL space you need, and check that there is still enough
preallocated WAL to satisfy all the reservations. If not, throw an error
or sleep until there is. After you're done with the insertion, release
the reservation by decreasing the number again.

A shared reservation counter like that could become a point of
contention. One optimization is keep a constant reservation of, say, 32
KB for each backend. That's enough for most operations. Change the logic
so that you check if you've exceeded the reserved amount of space
*after* writing the WAL record, while you're holding WALInsertLock
anyway. If you do go over the limit, set a flag in backend-private
memory indicating that the *next* time you're about to enter a critical
section where you will write a WAL record, you check again if more space
has been made available.

- Heikki

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-06-06 14:16:53 pg_ugprade use of --check and --link
Previous Message Amit Kapila 2013-06-06 13:32:27 Re: Partitioning performance: cache stringToNode() of pg_constraint.ccbin