Re: Funny WAL corruption issue

From: Chris Travers <chris(dot)travers(at)gmail(dot)com>
To: Vladimir Rusinov <vrusinov(at)google(dot)com>
Cc: Aleksander Alekseev <a(dot)alekseev(at)postgrespro(dot)ru>, Vladimir Borodin <root(at)simply(dot)name>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Funny WAL corruption issue
Date: 2017-08-10 14:26:52
Message-ID: CAKt_Zfvj=0cXBqEW2UBjtcY7Y2munm1Z7dPqxTh4PSCA76cB-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 10, 2017 at 3:17 PM, Vladimir Rusinov <vrusinov(at)google(dot)com>
wrote:

>
>
> On Thu, Aug 10, 2017 at 1:48 PM, Aleksander Alekseev <
> a(dot)alekseev(at)postgrespro(dot)ru> wrote:
>
>> I just wanted to point out that a hardware issue or third party software
>> issues (bugs in FS, software RAID, ...) could not be fully excluded from
>> the list of suspects. According to the talk by Christophe Pettus [1]
>> it's not that uncommon as most people think.
>
>
> This still might be the case of hardware corruption, but it does not look
> like one.
>

Yeah, I don't think so either. The systems were not restarted, only the
service so I don't think this is a lie-on-write case. We have EEC with
full checks, etc. It really looks like something I initiated caused it but
not sure what and really not interested in trying to reproduce on a db of
this size.

>
> Likelihood of two different persons seeing similar error message just a
> year apart is low. From our practice hardware corruption usually looks like
> a random single bit flip (most common - bad cpu or memory), bunch of zeroes
> (bad storage), or bunch of complete garbage (usually indicates in-memory
> pointer corruption).
>
> Chris, if you still have original WAL segment from the master and it's
> corrupt copy from standby, can you do bit-by-bit comparison to see how they
> are different? Also, if you can please share some hardware details.
> Specifically, do you use ECC? If so, are there any ECC errors logged? Do
> you use physical disks/ssd or some form of storage virtualization?
>

Straight on bare metal, eec with no errors logged. SSD for both data and
wal.

The bitwise comparison is interesting. Remember the error was:

pg_xlogdump: FATAL: error in WAL record at 1E39C/E1117FB8: unexpected
pageaddr 1E375/61118000 in log segment 000000000001E39C000000E1, offset
1146880

Starting with the good segment:

Good wall segment, I think the record starts at 003b:

0117fb0 0000 0000 0000 0000 003b 0000 0000 0000

0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01

0117fd0 0200 0000 067f 0000 4000 0000 2249 0195

0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000

0117ff0 0000 0003 0000 0000 008c 0000 0000 0000

0118000 d093 0005 0001 0000 8000 e111 e39c 0001

0118010 0084 0000 0000 0000 7fb8 e111 e39c 0001

0118020 0910 0000 ccac 2eba 2000 0056 067f 0000

0118030 4000 0000 2249 0195 b5c4 0000 08ff 0001

0118040 0002 0003 0004 0005 0006 0007 0008 0009

0118050 000a 000b 000c 000d 000e 000f 0010 0011

0118060 0012 0013 0014 0015 0016 0017 0018 0019

0118070 001a 001b 001c 001d 001e 001f 0020 0021

0117fb0 0000 0000 0000 0000 003b 0000 0000 0000

0117fc0 7f28 e111 e39c 0001 0940 0000 cb88 db01

0117fd0 0200 0000 067f 0000 4000 0000 2249 0195

0117fe0 0001 0000 8001 0000 b5c3 0000 05ff 0000

0117ff0 0000 0003 0000 0000 4079 ce05 1cce ecf9

0118000 d093 0005 0001 0000 8000 6111 e375 0001

0118010 119d 0000 0000 0000 cfd4 00cc ca00 0410

0118020 1800 7c00 5923 544b dc20 914c 7a5c afec

0118030 db45 0060 b700 1910 1800 7c00 791f 2ede

0118040 c573 a110 5a88 e1e6 ab48 0034 9c00 2210

0118050 1800 7c00 4415 400d 2c7e b5e3 7c88 bcef

0118060 4666 00db 9900 0a10 1800 7c00 7d1d b355
0118070 d432 8365 de99 4dba 87c7 00ed 6200 2210

I think the divergence is interesting here. Up through 0117ff8, they are
identical. Then the last half if the line differs.
The first half of the next is the same (but up through 011800a this time),
but the last 6 bytes differ (those six hold what appear to be the memory
address causing the problem), and we only have a few bits different in the
rest of the line.

It looks like some data and some flags were overwritten, perhaps while the
process exited. Very interesting.

> Also, in absolute majority of cases corruption is caught by checksums. I
> am not familiar with WAL protocol - do we have enough checksums when
> writing it out and on the wire? I suspect there are much more things
> PostgreSQL can do to be more resilient, and at least detect corruptions
> earlier.
>

Since this didn't throw a checksum error (we have data checksums disabled
but wal records ISTR have a separate CRC check), would this perhaps
indicate that the checksum operated over incorrect data?

>
> --
> Vladimir Rusinov
> PostgreSQL SRE, Google Ireland
>
> Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
> Registered in Dublin, Ireland
> Registration Number: 368047
>

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-08-10 14:36:11 Re: pl/perl extension fails on Windows
Previous Message Alvaro Herrera 2017-08-10 14:17:30 Re: Comment in snapbuild.c file