Re: SR fails to send existing WAL file after off-line copy

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Matt Chesler <matt(at)pragmatrading(dot)com>
Subject: Re: SR fails to send existing WAL file after off-line copy
Date: 2010-11-01 03:21:21
Message-ID: AANLkTin1=k=OYrCfeMqrJuXa_+0312SWuoEqaF1adiDp@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Oct 31, 2010 at 5:31 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Which is confusing because that file is certainly on the master still, and
> hasn't even been considered archived yet much less removed:
>
> [master(at)pyramid pg_log]$ ls -l $PGDATA/pg_xlog
> -rw------- 1 master master 16777216 Oct 31 16:29 000000010000000000000000
> drwx------ 2 master master     4096 Oct  4 12:28 archive_status
> [master(at)pyramid pg_log]$ ls -l $PGDATA/pg_xlog/archive_status/
> total 0
>
> So why isn't SR handing that data over?  Is there some weird unhandled
> corner case this exposes, but that wasn't encountered by the systems the
> tutorial was tried out on?  I'm not familiar enough with the SR internals to
> reason out what's going wrong myself yet.  Wanted to validate that Matt's
> report wasn't a unique one though, with a bit more detail included about the
> state the system gets into, and one potential fix (increasing
> wal_keep_segments) already tried without improvement.

There seem to be two cases in the code that can generate that error.
One, attempting to open the file returns ENOENT. Two, after the data
has been read, the last-removed position returned by
XLogGetLastRemoved precedes the data we think we just read, implying
that it was overwritten while we were in the process of reading it.
Does your installation have debugging symbols? Can you figure out
which case is triggering (inside XLogRead) and what the values of log,
seg, lastRemovedLog, and lastRemovedSeg are?

Even if you lack debugging symbols, if you have gdb, you might be able
figure out which case is triggering by looking at whether
XLogGetLastRemoved gets called before the error message is printed
(put a breakpoint on that function).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-11-01 05:12:29 Re: Maximum function call nesting depth for regression tests
Previous Message Itagaki Takahiro 2010-11-01 03:17:02 Comparison with "true" in source code