Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

From: David Powers <dpowers(at)janestreet(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-15 12:42:07
Message-ID: CAJpcCMhxK56fyjj708Q2x-8F8Q2nacJ5gs9ALMFW13K9sqjeoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

First, thanks for the replies. This sort of thing is frustrating and hard
to diagnose at a distance, and any help is appreciated.

Here is some more background:

We have 3 9.2.4 databases using the following setup:

- A primary box
- A standby box running as a hot streaming replica from the primary
- A testing box restored nightly from a static backup

As noted, the static backup is taken off of the standby by taking an LVM
snapshot of the database filesystem and rsyncing. I don't think it's a
likely problem but the rsync leverages the previous backup (using
--link-dest) to make the rsync faster and the resulting backup smaller.
Each database is ~1.5T, so this is necessary to keep static backup times
reasonable.

We've been using the same system for quite some time, but previously (~ 1
month ago) had been taking the backup off of the primary (still using the
LVM snapshot). The replication is a recent addition, and a very helpful
one. LVM snapshots aren't lightweight in the face of writes and in some
circumstances a long running rsync would spike the IO load on the
production box.

Results of some additional tests:

After the user noticed that the test restore showed the original problem we
ran `vacuum analyze` on all three testing databases thinking that it had a
good chance of quickly touching most of the underlying files. That gave us
errors on two of the testing restores similar to:

ERROR: invalid page header in block 5427 of relation base/16417/199732075

In the meantime I modified the static backup procedure to shut standby
completely down before taking the LVM snapshot and am trying a restore
using that snapshot now. I'll test that using the same vacuum analyze
test, and if that passes, a full vacuum.

I'm also running the vacuum analyze on the production machines to double
check that the base databases don't have a subtle corruption that simply
hasn't been noticed. They run with normal autovacuum settings, so I
suspect that they are fine/this won't show anything because we should have
seen this from the autovacuum daemon before.

I'm happy to share the scripts we use for the backup/restore process if the
information above isn't enough, as well as the logs - though the postgres
logs don't seem to contain much of interest (the database system doesn't
really get involved).

I also have the rsyncs of the failed snapshots available and could restore
them for testing purposes. It's also easy to look in them (they are just
saved as normal directories on a big SAN) if I know what to look for.

-David

On Wed, May 15, 2013 at 2:24 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 14.05.2013 23:47, Benedikt Grundmann wrote:
>
>> The only thing that is *new* is that we took the snapshot from the
>>
>> streaming replica. So again my best guess as of now is that if the
>> database crashes while it is in streaming standby a invalid disk state can
>> result during during the following startup (in rare and as of now unclear
>> circumstances).
>>
>
> A bug is certainly possible. There isn't much detail here to debug with,
> I'm afraid. Can you share the full logs on all three systems? I'm
> particularly interest
>
>
> You seem to be quite convinced that it must be LVM can you elaborate why?
>>
>
> Well, you said that there was a file in the original filesystem, but not
> in the snapshot. If you didn't do anything in between, then surely the
> snapshot is broken, if it skipped a file. Or was the file created in the
> original filesystem after the snapshot was taken? You probably left out
> some crucial details on how exactly the snapshot and rsync are performed.
> Can you share the scripts you're using?
>
> Can you reproduce this problem with a new snapshot? Do you still have the
> failed snapshot unchanged?
>
> - Heikki
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2013-05-15 13:04:44 Re: Logging of PAM Authentication Failure
Previous Message Nicholson, Brad (Toronto, ON, CA) 2013-05-15 12:15:54 postgres_fdw foreign tables and serial columns