Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

From: David Powers <dpowers(at)janestreet(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-23 16:30:41
Message-ID: CAJpcCMhzNgWLrYPS1pMHGxrsSKOwBPvoHE-ozWSj=jpbfTCdfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the response.

I have some evidence against an issue in the backup procedure (though I'm
not ruling it out). We moved back to taking the backup off of the primary
and all errors for all three clusters went away. All of the hardware is
the same, OS and postgres versions are largely the same (9.2.3 vs. 9.2.4 in
some cases, various patch levels of Cent 6.3 for the OS). The backup code
is exactly the same, just pointed at a different set of boxes.

Currently I'm just running for a couple of days to ensure that we have
viable static backups. After that I'll redo one of the restores from a
suspected backup and will post the logs.

-David

On Thu, May 23, 2013 at 11:26 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
> <bgrundmann(at)janestreet(dot)com> wrote:
> > We are seeing these errors on a regular basis on the testing box now. We
> > have even changed the backup script to
> > shutdown the hot standby, take lvm snapshot, restart the hot standby,
> rsync
> > the lvm snapshot. It still happens.
> >
> > We have never seen this before we introduced the hot standby. So we will
> > now revert to taking the backups from lvm snapshots on the production
> > database. If you have ideas of what else we should try / what
> information
> > we can give you to debug this let us know and we will try to so.
> >
> > Until then we will sadly operate on the assumption that the combination
> of
> > hot standby and "frozen snapshot" backup of it is not production ready.
>
> I'm pretty suspicious that your backup procedure is messed up in some
> way. The fact that you got invalid page headers is really difficult
> to attribute to a PostgreSQL bug. A number of the other messages that
> you have posted also tend to indicate either corruption, or that WAL
> replay has stopped early. It would be interesting to see the logs
> from when the clone was first started up, juxtaposed against the later
> WAL flush error messages.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2013-05-23 17:40:35 Re: pg_rewind, a tool for resynchronizing an old master after failover
Previous Message Stefan Kaltenbrunner 2013-05-23 16:20:05 gemulon.postgresql.org/gitmaster.postgresql.org