Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

From: Alex Malek <magicagent(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: bad wal on replica / incorrect resource manager data checksum in record / zfs
Date: 2020-04-02 17:44:57
Message-ID: CAGH8ccfa3fPoT0TizkrQ3Z4gz5XJi+pSBqN8CHUAHmqWEcf0zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 19, 2020 at 4:35 PM Alex Malek <magicagent(at)gmail(dot)com> wrote:

>
> Hello Postgres Hackers -
>
> We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> "incorrect resource manager data checksum in record at ..."
> This has been occurring on average once every 1 to 2 weeks during large
> data imports (100s of GBs being written)
> on one of two replicas.
> Fixing the issue has been relatively straight forward: shutdown replica,
> remove the bad wal file, restart replica and
> the good wal file is retrieved from the master.
> We are doing streaming replication using replication slots.
> However twice now, the master had already removed the WAL file so the file
> had to retrieved from the wal archive.
>
> The WAL log directories on the master and the replicas are on ZFS file
> systems.
> All servers are running RHEL 7.7 (Maipo)
> PostgreSQL 10.11
> ZFS v0.7.13-1
>
> The issue seems similar to
> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
> and to https://github.com/timescale/timescaledb/issues/1443
>
> One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
> sees our array as a single device.
> ....
> <snip>
>

An update in case someone else encounters the same issue.

About 5 weeks ago, on the master database server, we turned off ZFS
compression for the volume where the WAL log resides.
The error has not occurred on any replica since.

Best,
Alex

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-04-02 17:50:28 Re: Proposal: Expose oldest xmin as SQL function for monitoring
Previous Message Alvaro Herrera 2020-04-02 17:33:18 Re: Should we add xid_current() or a int8->xid cast?