question about warm standby databases in 8.2.5

From: "Brett Neumeier" <bneumeier(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: question about warm standby databases in 8.2.5
Date: 2007-12-11 04:43:11
Message-ID: 5f668d330712102043m17a391c7xeac6ba135ff673c2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I set up a warm standby failover system on Redhat, using built-from-source
postgresql 8.2.5 on (of course) both the master and standby systems.

The setup of the system was very easy, and the recovery script we have in
place on the standby system correctly copies in the archived WAL log files,
which are then applied.

What seems odd is what happens when we abort the continuous recovery so the
standby database becomes primary.

It seems that the recovery command always copies the source WAL file (with a
name like 00000001000000020000009C) to a file path "pg_xlog/RECOVERYXLOG",
which is fine. However, then when we abort recovery, postgresql seems to
expect that the most recent WAL log should be in pg_xlog with its original
filename, e.g., the 0....9C filename from above.

This seems broken -- if the WAL file should wind up in the pg_xlog directory
with the 0...9C name, why isn't postgresql copying it there?

Here are the log messages that show what I'm talking about. Note that
everything is fine for quite a while; then we triggered the standby database
to come online before 0...B4 was archived...and postgresql then bails out
because 0...B3 (which has already been restored) doesn't exist!

We're working around this, for now, by having the recovery command script
copy archived WAL files to the specified location pg_xlog/RECOVERYXLOG, and
also to the pg_xlog directory with the file's original basename. But that
seems awfully sloppy, and isn't the process documented in the manual.

Advice is eagerly solicited!

LOG: starting archive recovery
LOG: restore_command = "/home/pgsql/bin/recover_script.rb %f %p"
LOG: restored log file "0000000100000002000000A1.001FAD68.backup" from
archive
LOG: restored log file "0000000100000002000000A1" from archive
LOG: checkpoint record is at 2/A11FAD68
LOG: redo record is at 2/A11FAD68; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 0/82464990; next OID: 45282
LOG: next MultiXactId: 28; next MultiXactOffset: 55
LOG: automatic recovery in progress
LOG: redo starts at 2/A11FADB0
LOG: restored log file "0000000100000002000000A2" from archive
[a bunch of similar messages omitted]
LOG: restored log file "0000000100000002000000B3" from archive
LOG: could not open file "pg_xlog/0000000100000002000000B4" (log file 2,
segment 180): No such file or directory
LOG: redo done at 2/B354BDD0
PANIC: could not open file "pg_xlog/0000000100000002000000B3" (log file 2,
segment 179): No such file or directory
LOG: startup process (PID 17604) was terminated by signal 6
LOG: aborting startup due to startup process failure
LOG: database system was interrupted while in recovery at log time
2007-12-10 16:57:42 EST
HINT: If this has occurred more than once some data may be corrupted and
you may need to choose an earlier recovery target.

Cheers,

bn

--
Brett Neumeier (bneumeier(at)gmail(dot)com)

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Patrick TJ McPhee 2007-12-11 04:44:06 Re: Determining current block size?
Previous Message Tom Lane 2007-12-11 04:18:40 Re: partitioned table query question