Re: Hard limit on WAL space used (because PANIC sucks)

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)heroku(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Hard limit on WAL space used (because PANIC sucks)
Date: 2014-01-21 18:41:50
Message-ID: CA+U5nM+ipFyK_cNkP=NdF72mTpMzyv=hsFaN-Si1Wx=85PmP+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21 January 2014 18:35, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On 6 June 2013 16:00, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
>>> The current situation is that if you run out of disk space while writing
>>> WAL, you get a PANIC, and the server shuts down. That's awful.
>
>> I don't see we need to prevent WAL insertions when the disk fills. We
>> still have the whole of wal_buffers to use up. When that is full, we
>> will prevent further WAL insertions because we will be holding the
>> WALwritelock to clear more space. So the rest of the system will lock
>> up nicely, like we want, apart from read-only transactions.
>
> I'm not sure that "all writing transactions lock up hard" is really so
> much better than the current behavior.

Lock up momentarily, until the situation clears. But my proposal would
allow the situation to fully clear, i.e. all WAL files could be
deleted as soon as replication/archiving has caught up. The current
behaviour doesn't automatically correct itself as this proposal would.
My proposal is also fully safe in line with synchronous replication,
as well as zero performance overhead for mainline processing.

> My preference would be that we simply start failing writes with ERRORs
> rather than PANICs.

Yes, that is what I am proposing, amongst other points.

> I'm not real sure ATM why this has to be a PANIC
> condition. Probably the cause is that it's being done inside a critical
> section, but could we move that?

Yes, I think so.

>> Instead of PANICing, we should simply signal the checkpointer to
>> perform a shutdown checkpoint.
>
> And if that fails for lack of disk space?

I proposed a way to ensure it wouldn't fail for that, at least on pg_xlog space.

> In any case, what you're
> proposing sounds like a lot of new complication in a code path that
> is necessarily never going to be terribly well tested.

It's the smallest amount of change proposed so far... I agree on the
danger of untested code.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Adrian Klaver 2014-01-21 18:42:29 Re: Incorrectly reporting config errors
Previous Message Tom Lane 2014-01-21 18:35:55 Re: Incorrectly reporting config errors