warm standby, pg_standby, invalid checkpoint record

From: Brad Wiemerslage <wiemersl(at)yahoo(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: warm standby, pg_standby, invalid checkpoint record
Date: 2009-02-27 07:47:44
Message-ID: 818050.20576.qm@web56001.mail.re3.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin


I'm attempting to get warm standby up and running with a pair of servers running ubuntu 8.04 and postgresql 8.3. Been following the docs:

http://www.postgresql.org/docs/8.3/static/warm-standby.html
http://www.postgresql.org/docs/current/static/pgstandby.html

Also, basically following the ideas here in this blog post:

http://scale-out-blog.blogspot.com/2009/02/simple-ha-with-postgresql-point-in-time.html

I've customized the original script he refers to in the article, which is here in its entirety for reference:

https://s3.amazonaws.com/extras.continuent.com/standby.sh

Here is the meat of my customized script, which runs on the standby. The postgresql server on the standby is stopped first.

start_backup="SELECT pg_start_backup('my_backup');"
stop_backup="SELECT pg_stop_backup();"
echo "$start_backup" | $psql -h$PRIMARY -U myuser -d mydb -e
rsync --delete -avz -e "ssh -i /path/to/key" myuser(at)$PRIMARY:$PG_DATA/ $PG_DATA
echo "$stop_backup" | $psql -h $PRIMARY -U myuser -d mydb -e

The files seem to copied over to the standby machine just fine. Success is reported with respect to the backup commands. Permissions seem fine.

Next, there are some steps which blow out some files. As I understand it, you no longer need the files on the standby that were in pg_xlog on the primary.

rm -f $PG_DATA/recovery.*
rm -f $PG_DATA/8.3/main/logfile
rm -f $PG_DATA/8.3/main/postmaster.pid
rm -f $PG_DATA/8.3/main/pg_xlog/0*
rm -f $PG_DATA/8.3/main/pg_xlog/archive_status/0*

This step seems to work fine.

Then, the archives are pulled. They are pulled to /mnt/postgresql_archives with this command:

rsync --delete -avz -e "ssh -i /path/to/key" myuser(at)$PRIMARY:$PG_ARCHIVES/ $PG_ARCHIVES

Everything looks good. I end up with an up to date list of WAL files in /mnt/postgresql_archives on the standby. Here is a listing:

root(at)standby:/mnt/postgresql_archives# ls
total 688996
drwxr-xr-x 2 postgres postgres 4096 2009-02-27 01:13 .
drwxr-xr-x 14 root root 4096 2009-02-27 01:17 ..
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 00:19 0000000100000000000000CB
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 00:29 0000000100000000000000CC
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 00:38 0000000100000000000000CD
-rw-rw---- 1 postgres postgres 245 2009-02-27 00:38 0000000100000000000000CD.00000020.backup
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 00:48 0000000100000000000000CE
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 00:54 0000000100000000000000CF
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 00:58 0000000100000000000000D0
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 01:01 0000000100000000000000D1
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 01:03 0000000100000000000000D2
-rw-rw---- 1 postgres postgres 16777216 2009-02-27 01:13 0000000100000000000000D3

Then, the recovery.conf is put in place. I've tried two different versions, which end up giving me the same error. Here are the two different versions.

#1: restore_command = '/usr/lib/postgresql/8.3/bin/pg_standby -c -d -s 2 -t /mnt/postgresql_archives/pgsql.trigger /mnt/postgresql_archives %f %p >> /mnt/postgresql_archives/standby.log 1>&2'

#2: restore_command = 'cp /mnt/server/archivedir/%f "%p"'

I don't believe that #2 is suitable for warm standby, but just tried it to debug after #1 wouldn't work. Now, I try to start up the server. For it to work in standby mode, additional archive files will be pulled from the primary machine on a periodic basis. I'm using this command, which deletes them on the primary when they are no longer necessary. It also seems to work fine.

rsync -avz -e "ssh -i /path/to/key" myuser(at)$PRIMARY:$PG_ARCHIVES/ $PG_ARCHIVES

I guess I'm a little confused about exactly what is happening here when the server comes up, but here is the error message I'm getting. It seems to be looking for the files in pg_pxlog, which is cleared out. So, the error makes sense. But isn't it supposed to be looking in /mnt/postgresql_archives per the restore_command(s)? The files are available there.

2009-02-27 01:26:52.867 EST,,,7422,,49a787ac.1cfe,2,,2009-02-27 01:26:52 EST,,0,LOG,58P01,"could not open file ""pg_xlog/0000000100000000000000CD"" (log file 0, segment 20
5): No such file or directory",,,,,,,,
2009-02-27 01:26:52.867 EST,,,7422,,49a787ac.1cfe,3,,2009-02-27 01:26:52 EST,,0,LOG,00000,"invalid checkpoint record",,,,,,,,
2009-02-27 01:26:52.867 EST,,,7422,,49a787ac.1cfe,4,,2009-02-27 01:26:52 EST,,0,PANIC,XX000,"could not locate required checkpoint record",,"If you are not restoring from a
 backup, try removing the file ""/var/lib/postgresql/8.3/main/backup_label"".",,,,,,
2009-02-27 01:26:52.868 EST,,,7419,,49a787ab.1cfb,1,,2009-02-27 01:26:51 EST,,0,LOG,00000,"startup process (PID 7422) was terminated by signal 6: Aborted",,,,,,,,
2009-02-27 01:26:52.868 EST,,,7419,,49a787ab.1cfb,2,,2009-02-27 01:26:51 EST,,0,LOG,00000,"aborting startup due to startup process failure",,,,,,,,

So, I tried copying the files in the /mnt/postgresql_archives over to pg_xlog. This seemed to work, and the updates were applied. Never at any point did I get a recovery.done file. Also, for whatever reason, I was never able to get any debug info from pg_standby in standby.log.

Anyhow, I've burned up a couple days trying to figure this out. Any help would be much appreciated.

Thanks,
Brad

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Lentes, Bernd 2009-02-27 09:49:41 Re: Question to transaction ID wraparound
Previous Message Tony Liao 2009-02-27 02:04:14 Re: index for inet(ip address or ip range)