Re: Serious problem: media recovery fails after system or PostgreSQL crash

From: Daniel Farina <daniel(at)heroku(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Serious problem: media recovery fails after system or PostgreSQL crash
Date: 2012-12-07 00:16:54
Message-ID: CAAZKuFY2vYRb=-CzGKPKoqSApwyhWqzg+Hs30Sbk-ueq-+zieA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 6, 2012 at 9:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "MauMau" <maumau307(at)gmail(dot)com> writes:
>> I'm using PostgreSQL 9.1.6 on Linux. I encountered a serious problem that
>> media recovery failed showing the following message:
>> FATAL: archive file "000000010000008000000028" has wrong size: 7340032
>> instead of 16777216
>
> Well, that's unfortunate, but it's not clear that automatic recovery is
> possible. The only way out of it would be if an undamaged copy of the
> segment was in pg_xlog/ ... but if I recall the logic correctly, we'd
> not even be trying to fetch from the archive if we had a local copy.

I'm inclined to agree with this: I've had a case much like the
original poster (too-short WAL segments because of media issues),
except in my case the archiver had archived a bogus copy of the data
also (short length and all), so our attempt to recover from archives
on a brand new system was met with obscure failure[0]. And, rather
interestingly, the WAL disk was able to "write" bogusly without error
for many minutes, which made for a fairly exotic recovery based on
pg_resetxlog: I grabbed quite a few minutes of WAL of various sub-16MB
sizes to spot check the situation.

It never occurred to me there was a way to really fix this that still
involves the archiver reading from a file system: what can one do when
one no longer trusts read() to get what was write()d?

[0]: Postgres wasn't very good about reporting the failure: in the
case bogus files have been restored from archives, it seems to just
bounce through timelines it knows about searching for a WAL it likes,
without any real error messaging like got "corrupt wal from archive".
That could probably be fixed.

--
fdr

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-12-07 00:28:33 pg_upgrade problem with invalid indexes
Previous Message Simon Riggs 2012-12-07 00:16:23 Re: -DCLOBBER_CACHE_ALWAYS shows COPY FREEZE regression problem