Re: Hard limit on WAL space used (because PANIC sucks)

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hard limit on WAL space used (because PANIC sucks)
Date: 2013-06-06 23:25:29
Message-ID: 51B11A69.4050909@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Let's talk failure cases.

There's actually three potential failure cases here:

- One Volume: WAL is on the same volume as PGDATA, and that volume is
completely out of space.

- XLog Partition: WAL is on its own partition/volume, and fills it up.

- Archiving: archiving is failing or too slow, causing the disk to fill
up with waiting log segments.

I'll argue that these three cases need to be dealt with in three
different ways, and no single solution is going to work for all three.

Archiving
---------

In some ways, this is the simplest case. Really, we just need a way to
know when the available WAL space has become 90% full, and abort
archiving at that stage. Once we stop attempting to archive, we can
clean up the unneeded log segments.

What we need is a better way for the DBA to find out that archiving is
falling behind when it first starts to fall behind. Tailing the log and
examining the rather cryptic error messages we give out isn't very
effective.

xLog Partition
--------------

As Heikki pointed, out, a full dedicated WAL drive is hard to fix once
it gets full, since there's nothing you can safely delete to clear
space, even enough for a checkpoint record.

On the other hand, it should be easy to prevent full status; we could
simply force a non-spread checkpoint whenever the available WAL space
gets 90% full. We'd also probably want to be prepared to switch to a
read-only mode if we get full enough that there's only room for the
checkpoint records.

One Volume
----------

This is the most complicated case, because we wouldn't necessarily run
out of space because of WAL using it up. Anything could cause us to run
out of disk space, including activity logs, swapping, pgsql_tmp files,
database growth, or some other process which writes files.

This means that the DBA getting out of disk-full manually is in some
ways easier; there's usually stuff she can delete. However, it's much
harder -- maybe impossible -- for PostgreSQL to prevent this kind of
space outage. There should be things we can do to make it easier for
the DBA to troubleshoot this, but I'm not sure what.

We could use a hard limit for WAL to prevent WAL from contributing to
out-of-space, but that'll only prevent a minority of cases.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2013-06-06 23:33:57 Re: Hard limit on WAL space used (because PANIC sucks)
Previous Message Josh Berkus 2013-06-06 23:05:58 Re: Redesigning checkpoint_segments