Re: warning message in standby

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: warning message in standby
Date: 2010-06-29 13:33:20
Message-ID: AANLkTimzwrEKk7HfREaoGw6TjrzOOiG7cDn80W-aNwPp@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 29, 2010 at 6:59 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jun 29, 2010 at 3:55 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Tue, Jun 15, 2010 at 11:35 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> On the other hand, I like immediate-panicking. And I don't want the standby
>>> to retry reconnecting the master infinitely.
>>
>> On second thought, the peremptory PANIC is not good for HA system. If the
>> master unfortunately has written an invalid record because of its crash,
>> the standby would exit with PANIC before performing a failover.
>
> I don't think that should ever happen.  The master only streams WAL
> that it has fsync'd.  Presumably there's no reason for the master to
> ever fsync a partial WAL record (which is usually how a corrupt record
> gets into the stream).
>
>> So when an invalid record is found in streamed WAL file, we should keep
>> the standby running and leave the decision whether the standby retries to
>> connect to the master forever or shuts down right now, up to the user
>> (actually, it may be a clusterware)?
>
> Well, if we want to leave it up to the user/clusterware, the current
> code is possibly adequate, although there are many different log
> messages that could signal this situation, so coding it up might not
> be too trivial.

So here's a patch that seems to implement the behavior I'm thinking of
- if we repeatedly retrieve the same WAL record from the master, and
we never succeed in replaying it, then give up.

It seems we don't have 100% consensus on this, but I thought posting
the patch might inspire some further thoughts. I'm really
uncomfortable with the idea that if the slave gets out of sync with
the master we'll just do this forever:

FATAL: terminating walreceiver process due to administrator command
LOG: streaming replication successfully connected to primary
LOG: invalid record length at 0/313FB638
FATAL: terminating walreceiver process due to administrator command
LOG: streaming replication successfully connected to primary
LOG: invalid record length at 0/313FB638
FATAL: terminating walreceiver process due to administrator command
LOG: streaming replication successfully connected to primary
LOG: invalid record length at 0/313FB638
FATAL: terminating walreceiver process due to administrator command
LOG: streaming replication successfully connected to primary
LOG: invalid record length at 0/313FB638

...with this patch, following the above, you get:

FATAL: invalid record in WAL stream
HINT: Take a new base backup, or remove recovery.conf and restart in
read-write mode.
LOG: startup process (PID 6126) exited with exit code 1
LOG: terminating any other active server processes

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment Content-Type Size
bound-corrupt-record-retries.patch application/octet-stream 3.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-06-29 13:35:18 Re: Keepalives win32
Previous Message Mike Fowler 2010-06-29 11:22:06 Re: [PATCH] Re: Adding XMLEXISTS to the grammar