Re: Funny WAL corruption issue

From: Chris Travers <chris(dot)travers(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Vladimir Rusinov <vrusinov(at)google(dot)com>, Aleksander Alekseev <a(dot)alekseev(at)postgrespro(dot)ru>, Vladimir Borodin <root(at)simply(dot)name>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Funny WAL corruption issue
Date: 2017-08-11 13:45:10
Message-ID: CAKt_ZfvoqxR25Zv7mWfxkaju2GSTEkJFNZXCdp_mwJ+6aZtx0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 11, 2017 at 1:33 PM, Greg Stark <stark(at)mit(dot)edu> wrote:

> On 10 August 2017 at 15:26, Chris Travers <chris(dot)travers(at)gmail(dot)com> wrote:
> >
> >
> > The bitwise comparison is interesting. Remember the error was:
> >
> > pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected
> > pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset
> > 1146880
> ...
> > Since this didn't throw a checksum error (we have data checksums
> disabled but wal records ISTR have a separate CRC check), would this
> perhaps indicate that the checksum operated over incorrect data?
>
> No checksum error and this "unexpected pageaddr" doesn't necessarily
> mean data corruption. It could mean that when the database stopped logging
> it was reusing a wal file and the old wal stream had a record boundary
> on the same byte position. So the previous record checksum passed and
> the following record checksum passes but the record header is for a
> different wal stream position.
>
> I think you could actually hack xlogdump to ignore this condition and
> keep outputting and you'll see whether the records that follow appear
> to be old wal log data. I haven't actually tried this though.
>

For better or worse, I get a different error at the same spot if I try this:

Doing so involved disabling the check in the backend wal reader.

pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: invalid
contrecord length 4509 at 1E39C/E1117FF8

If I hack it to ignore all errors on that record, no further records come
out though it does run over the same records.

This leads me to conclude there are no further valid records.

>
> --
> greg
>

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-08-11 13:55:19 Re: Thoughts on unit testing?
Previous Message Peter Eisentraut 2017-08-11 13:39:48 Re: Thoughts on unit testing?