bad wal on replica / incorrect resource manager data checksum in record / zfs

From: Alex Malek <magicagent(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: bad wal on replica / incorrect resource manager data checksum in record / zfs
Date: 2020-02-19 21:35:53
Message-ID: CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1=tifg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Postgres Hackers -

We are having a reoccurring issue on 2 of our replicas where replication
stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during large
data imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica,
remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so the file
had to retrieved from the wal archive.

The WAL log directories on the master and the replicas are on ZFS file
systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1

The issue seems similar to
https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
and to https://github.com/timescale/timescaledb/issues/1443

One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
sees our array as a single device.

Right before the issue started we did some upgrades and altered some
postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issue continues.

Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week the
issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the
differences were not random bad bytes.

The bad file either had a block of zeros that were not in the good file at
that position or other data. Occasionally the bad data has contained
legible strings not in the good file at that position. At least one of
those exact strings has existed elsewhere in the files.
However I am not sure if that is the case for all of them.

This made me think that maybe there was an issue w/ wal file recycling and
ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my
understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).

There is a third replica where this bug has not (yet?) surfaced.
This leads me to guess the bad data does not originate on the master.
This replica is older than the other replicas, slower CPUs, less RAM, and
the WAL disk array is spinning disks.
The OS, version of Postgres, and version of ZFS are the same as the other
replicas.
This replica is not using a replication slot.
This replica does not serve users so load/contention is much lower than the
others.
The other replicas often have 100% utilization of the disk array that
houses the (non-wal) data.

Any insight into the source of this bug or how to address it?

Since the master has a good copy of the WAL file, can the replica
re-request the file from the master? Or from the archive?

When using replication slots, what circumstances would cause the master to
not save the WAL file?
(I can't remember if it always had the next wal file or the one after that)

Thanks in advance,
Alex Malek

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-02-19 22:38:24 Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()
Previous Message Tom Lane 2020-02-19 21:32:10 Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()