Streaming replication fails after some time with 'incorrect resource manager data checksum'

From: Julian Backes <julianbackes(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Streaming replication fails after some time with 'incorrect resource manager data checksum'
Date: 2019-12-18 12:45:56
Message-ID: CAPv0rXGZtFr2u5o3g70OMoH+WQYhmwq1aGsmL+PQHMjFf71Dkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello all!

I already posted in the slack chat #help channel but got no answer :-(

We have a read only / hot standby system and are facing the same problem as
described in
https://stackoverflow.com/questions/35752389/incorrect-resource-manager-data-checksum-in-record-at-2-xyz-terminating-walrec
(the post is already 3 years old).

That means after some time (sometimes two days, sometimes half a day),
postgres starts logging 'incorrect resource manager data checksum in record
at xyz' and shuts down wal receiver (and stops streaming replication).

Master and slave are running on Ubuntu 18.04, Postgres 12.1, ext4 file
system (no zfs or btrfs, just lvm on the master); we only use ecc memory
(192 gb on the master and 256 gb on the slave) and nvme ssds on both
servers using a soft raid 1.
When the error occurs, a restart of postgres on the slave "fixes" the
problem.

Any ideas what we can do to prevent/investigate the problem?

Kind regards
Julian

Browse pgsql-general by date

  From Date Subject
Next Message hubert depesz lubaczewski 2019-12-18 14:06:31 Re: Fwd: weird long time query
Previous Message Josef Šimánek 2019-12-18 12:10:17 Re: REINDEX VERBOSE unknown option