Re: WARNINGs after starting backup server created with PITR

From: Erik Jones <erik(at)myemma(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, General postgres mailing list <pgsql-general(at)postgresql(dot)org>
Subject: Re: WARNINGs after starting backup server created with PITR
Date: 2008-01-19 19:53:05
Message-ID: 8175BC3D-D9FD-4DA5-87B7-CC9A3E5E0005@myemma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On Jan 19, 2008, at 1:24 PM, Simon Riggs wrote:

> So we definitely have missing data. I think the multi-phase rsync is
> definitely suspect and should be avoided until we get to the bottom of
> this.

By this, do you mean the way we ran the rsync multiple times before
performing the actual base backup? Easy enough.

> Can you examine the server logs for that 11 hour period and see if
> there
> are any other messages that might be relevant on both primary and
> standby servers?

On the primary, all relevant log entries are just entries for
successfully logged WALs.

> Can you save the WAL files covering that period also. We may want to
> inspect them later to confirm whether the data was actually in them or
> not. Can you save the very first WAL files after the recovery started?
> We can use those to examine the first data blocks touched, which would
> help confirm when the gun was fired.

Sure, I still have all of the WALs that were shipped from just before
the beginning of the base back up until about 8 or 9 hours after the
missing WAL file. I'm not sure if I'll be able to get to it today as
I've got lots of household chores to get to, but either tomorrow of
first thing Monday I'll zip them and post them here.

> Has rsync produced any messages of note? Might your script be ignoring
> errors? What versions of rsync are you using?

Version 2.6.9 on all three servers (1 primary, 2 standbys).

Here's the python script I was using to ship to both servers, right
now I'm back to a direct rsync call for my archive_command to sb1.
What's really weird, is that for the two WALs that disappeared, or
didn't make it, even though the primary's log said they were
successfully archived, they weren't on either server.

#####
#!/usr/bin/env python
from commands import getstatusoutput
import sys

source_path = sys.argv[1]
file_name = sys.argv[2]

# db2
res, out = getstatusoutput("ssh db2 'test -f /mnt/xfer/wal_archive/%
s'" % file_name)
if res != 0:
res, out = getstatusoutput("rsync -a %s db2:/mnt/xfer/wal_archive/"
% source_path)

if res != 0:
sys.exit(res)

# sb1
res, out = getstatusoutput("test -f /mnt/xfer/wal_archive/%s" %
file_name)
if res != 0:
res, out = getstatusoutput("rsync -a %s /mnt/xfer/wal_archive",
source_path)

sys.exit(res)
#####

> Is the History table Insert only? We might use that fact to examine
> the
> LSNs of the equivalent blocks on the Primary. If the LSNs are prior to
> the start of the recovery, as noted in the backup label file of the
> original base backup, then we can confirm rsync or postgres as the
> cause. Do you still have the base backup label file or base backup?

I still have the backup label but not the base backup. The history
tables are about 90-95% insert only, on average, but they do see the
occasional update if a message's delivery status is a bounce. If you
think this would still be an option, I'd be happy to follow whatever
instructions, or supply any files, needed.

Erik Jones

DBA | Emma®
erik(at)myemma(dot)com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David Fetter 2008-01-19 23:00:35 Re: Replication Using Triggers
Previous Message Simon Riggs 2008-01-19 19:25:01 Re: WARNINGs after starting backup server created with PITR