Re: BUG #6170: hot standby wedging on full-WAL disk

From: Daniel Farina <daniel(at)heroku(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6170: hot standby wedging on full-WAL disk
Date: 2011-08-25 17:49:46
Message-ID: CAAZKuFbAAmMkEjdHtGetHB4xNjSFx9v_UZZeDVpgROGCNMAO-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Aug 25, 2011 at 10:16 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 25.08.2011 19:11, Robert Haas wrote:
>>
>> On Mon, Aug 22, 2011 at 2:57 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com>  wrote:
>>>
>>> So the problem is that walreceiver merrily writes so much future WAL that
>>> it
>>> runs out of disk space? A limit on the maximum number of future WAL files
>>> to
>>> stream ahead would fix that, but I can't get very excited about it.
>>> Usually
>>> you do want to stream as much ahead as you can, to ensure that the WAL is
>>> safely on disk on the standby, in case the master dies. So the limit
>>> would
>>> need to be configurable.
>>
>> It seems like perhaps what we really need is a way to make replaying
>> WAL (and getting rid of now-unneeded segments) to take priority over
>> getting new ones.
>
> With the defaults we start to kill queries after a while that get in the way
> of WAL replay. Daniel had specifically disabled that. Of course, even with
> the query-killer disabled, it's possible for the WAL replay to fall so badly
> behind that you fill the disk, so a backstop might be useful anyway,
> although that seems a lot less likely in practice and if your standby can't
> keep up you're in trouble anyway.

I do think it's not a bad idea to have postgres prune unnecessary WAL
at least enough so it can get the WAL segment it wants -- basically
unsticking the recovery command so progress can be made. Right now
someone (like me) has to go and trim away what appear to be
unnecessary wal in (what is currently) a manual process.

Also, I'm not sure if the segments that are downloaded via
restore_command during the fall-behind time are "counted" towards
replay when un-sticking after a restart of postgres: in particular, I
believe that PG will want to copy the segments a second time, although
I'm not 100% sure right now. Regardless, not being able to restart
properly or make progress after killing the offensive backend are
unhappy things.

More thoughts?

--
fdr

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Claudio Oliveira 2011-08-25 18:29:24 Re: BUG #6177: Size field type TEXT
Previous Message Dave Page 2011-08-25 17:41:30 Re: BUG #6169: a non fatal error occured during cluster.... problem with environment variables