Re: restartpoints stop generating on streaming replication slave

From: Mathieu Fenniak <mathieu(dot)fenniak(at)replicon(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: restartpoints stop generating on streaming replication slave
Date: 2012-08-22 15:48:49
Message-ID: CAHoiPjz+qhWOES+q+LZ01qhWG8uYc5-xdKChky1-dcwMzyn4VA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Fujii,

Thanks for the quick reply.

We tried setting the log_min_messages using set_config() to debug2, but
this doesn't seem to take affect on the bgwriter process; if we changed
this in postgresql.conf, we'd have to run with the verbose logging for days
or weeks before the restartpoints stop. The pg_controldata definitely
shows that the master is checkpointing normally, but the slave's last
checkpoint corresponds to the last restartpoint logged. (the output of the
pg_controldata is in this e-mail from my colleague:
http://archives.postgresql.org/pgsql-general/2012-08/msg00335.php)

We have a theory as to the cause of this problem, today. We've been
looking into data corruption issues on our slave at the same time as we've
been looking into this problem. It turns out that our automated slave
startup has been deleting the "backup_label" file while syncing from the
master, causing the slave to miss replaying WAL data, and causing
random-appearing corruption errors on queries from the slave. At this
point we're theorizing that any issues we have on the slave are probably
issues caused by the corrupt backup restoration; we're going to fix the
corruption issue and then see if this problem ever re-occurs.

Mathieu

On Wed, Aug 22, 2012 at 8:10 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Wed, Aug 22, 2012 at 5:52 AM, Mathieu Fenniak
> <mathieu(dot)fenniak(at)replicon(dot)com> wrote:
> > Hi all,
> >
> > I've been investigating an issue with our PostgreSQL 9.1.1 (Linux x86-64
> > CentOS 5.8) database where restartpoints suddenly stop being generated on
> > the slave after working correctly for a week or two. The symptom of the
> > problem is that the pg_xlog directory on the slave doesn't get cleaned
> up,
> > and the log_checkpoints output (eg. restartpoint starting: time) stops
> > appearing.
> >
> > I was able to extract a core dump of the bgwriter process while it was in
> > BgWriterNap. I inspected ckpt_start_time and last_checkpoint_time;
> > ckpt_start_time was 1345578533 (... 19:48:53 GMT) and
> last_checkpoint_time
> > was 1345578248 (... 19:44:08 GMT). Based upon these values, I concluded
> > that it's performing checkpoints but missing the "if (ckpt_performed)"
> > condition (ie. CreateRestartPoint returns false); it's then setting
> > last_checkpoint_time to now - 5 minutes + 15 seconds.
> >
> > There seems to be two causes of a false retval in CreateRestartPoint; the
> > first is if !RecoveryInProgress(), and the second is if "the last
> checkpoint
> > record we've replayed is already our last restartpoint". The first
> > condition doesn't seem likely; does anyone know how we might be hitting
> the
> > second condition? We have continuous traffic on the master server in the
> > range of 1000 txn/s, and the slave seems to be completely up-to-date, so
> I
> > don't understand how we could be hitting this condition.
>
> To check whether you really hit either of the above two conditions, could
> you
> set log_min_messages to DEBUG2 on the standby? If you hit either, you'll
> get the log message like "skipping restartpoint......".
>
> Could you execute pg_controldata on both master and standby, and check
> whether their "Latest checkpoint location" are the same?
>
> Regards,
>
> --
> Fujii Masao
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-08-22 17:01:04 Isn't remote_write a really dumb name for that setting?
Previous Message Fujii Masao 2012-08-22 15:01:40 Re: Avoiding shutdown checkpoint at failover