Skip site navigation (1) Skip section navigation (2)

SR fails to send existing WAL file after off-line copy

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Matt Chesler <matt(at)pragmatrading(dot)com>
Subject: SR fails to send existing WAL file after off-line copy
Date: 2010-10-31 21:31:44
Message-ID: 4CCDE040.6030409@2ndquadrant.com (view raw or flat)
Thread:
Lists: pgsql-hackers
Last week we got this report from Matt Chesler:  
http://archives.postgresql.org/pgsql-admin/2010-10/msg00221.php that he 
was getting errors when trying to do a simple binary replication test.  
The problem is that what appears to be a perfectly good WAL segment 
doesn't get streamed to the standby.  No one responded. 

One of our testers just ran into the same thing.  I just investigated, 
and I'm baffled as to what's going on myself.  Can't tell if this is a 
bug or an under-documented restriction, but this makes two reports of 
the problem now.  (Mine is happening on a standard 9.0.0 RPM set, didn't 
notice any changes in 9.0.1 that would impact this; afraid to upgrade 
while I have a repeatable test case for this sitting here)

The setup is intended to get a simple test replication setup going 
without even having to do the whole pg_start_backup shuffle, by copying 
the whole server directory when it's down.  Basic steps are:

-Follow the first set of instructions at 
http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial to setup a 
master compatible with replication, then duplicate it after stopping it 
using rsync.  Note that you may have to manually create an empty pg_xlog 
directory on the standby, depending on what was there before you started 
replication. 

To rule out one possible source of problems here, I made an additional 
change on the master not listed there:

[master(at)pyramid pg_log]$ psql -d postgres -c "show wal_keep_segments"
 wal_keep_segments
-------------------
 10

I wondered if having this set to 0 (the default) was causing the issue, 
thinking perhaps it doesn't do any looking for existing segments at all 
in that situation.  Problem still happens for me.

-Create a recovery.conf pointing to the master as described in the tutorial.

-Start the standby.  Make sure that it has reached the point where it's 
requesting WAL segments from the master; you want to see it looping 
doing periodic "FATAL:  could not connect to the primary server: could 
not connect to server: Connection refused" before touching the master.

-Start the master

What I expect to happen now is that the current WAL file that was in 
progress at the point the data was copied over gets streamed over.  That 
doesn't seem to happen.  Instead, I see this on the standby:

LOG:  streaming replication successfully connected to primary
FATAL:  could not receive data from WAL stream: FATAL:  requested WAL 
segment 000000010000000000000000 has already been removed

This on the master:

LOG:  replication connection authorized: user=rep host=127.0.0.1 port=52571
FATAL:  requested WAL segment 000000010000000000000000 has already been 
removed

Which is confusing because that file is certainly on the master still, 
and hasn't even been considered archived yet much less removed:

[master(at)pyramid pg_log]$ ls -l $PGDATA/pg_xlog
-rw------- 1 master master 16777216 Oct 31 16:29 000000010000000000000000
drwx------ 2 master master     4096 Oct  4 12:28 archive_status
[master(at)pyramid pg_log]$ ls -l $PGDATA/pg_xlog/archive_status/
total 0

So why isn't SR handing that data over?  Is there some weird unhandled 
corner case this exposes, but that wasn't encountered by the systems the 
tutorial was tried out on?  I'm not familiar enough with the SR 
internals to reason out what's going wrong myself yet.  Wanted to 
validate that Matt's report wasn't a unique one though, with a bit more 
detail included about the state the system gets into, and one potential 
fix (increasing wal_keep_segments) already tried without improvement.

-- 
Greg Smith   2ndQuadrant US    greg(at)2ndQuadrant(dot)com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us



Responses

pgsql-hackers by date

Next:From: Andres FreundDate: 2010-10-31 21:41:50
Subject: [PATCH] Custom code int(32|64) => text conversions out of performance reasons
Previous:From: Alex HunsakerDate: 2010-10-31 21:24:20
Subject: Re: why does plperl cache functions using just a bool for is_trigger

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group