Fwd: bad wal on replica / incorrect resource manager data checksum in record / zfs

From: Alex Malek <magicagent(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Subject: Fwd: bad wal on replica / incorrect resource manager data checksum in record / zfs
Date: 2020-02-20 17:01:45
Message-ID: CAGH8ccf1OvcewXMQV7b1+iWA-u2TZe+-z2nmxJMFXNqecfK-bg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent(at)gmail(dot)com> wrote:
> >
> >
> > Hello Postgres Hackers -
> >
> > We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> > "incorrect resource manager data checksum in record at ..."
> > This has been occurring on average once every 1 to 2 weeks during large
> data imports (100s of GBs being written)
> > on one of two replicas.
> > Fixing the issue has been relatively straight forward: shutdown replica,
> remove the bad wal file, restart replica and
> > the good wal file is retrieved from the master.
> > We are doing streaming replication using replication slots.
> > However twice now, the master had already removed the WAL file so the
> file had to retrieved from the wal archive.
> >
> > The WAL log directories on the master and the replicas are on ZFS file
> systems.
> > All servers are running RHEL 7.7 (Maipo)
> > PostgreSQL 10.11
> > ZFS v0.7.13-1
> >
> > The issue seems similar to
> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
> and to https://github.com/timescale/timescaledb/issues/1443
> >
> > One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
> sees our array as a single device.
> >
> > Right before the issue started we did some upgrades and altered some
> postgres configs and ZFS settings.
> > We have been slowly rolling back changes but so far the the issue
> continues.
> >
> > Some interesting data points while debugging:
> > We had lowered the ZFS recordsize from 128K to 32K and for that week the
> issue started happening every other day.
> > Using xxd and diff we compared "good" and "bad" wal files and the
> differences were not random bad bytes.
> >
> > The bad file either had a block of zeros that were not in the good file
> at that position or other data. Occasionally the bad data has contained
> legible strings not in the good file at that position. At least one of
> those exact strings has existed elsewhere in the files.
> > However I am not sure if that is the case for all of them.
> >
> > This made me think that maybe there was an issue w/ wal file recycling
> and ZFS under heavy load, so we tried lowering
> > min_wal_size in order to "discourage" wal file recycling but my
> understanding is a low value discourages recycling but it will still
> > happen (unless setting wal_recycle in psql 12).
> >
>
> We do print a message "recycled write-ahead log file .." in DEBUG2
> mode. You either want to run the server with DEBUG2 or maybe change
> the code to make it LOG and see if that is printed. If you do that,
> you can verify if the corrupted WAL is the same as a recycled one.
>

Are you suggesting having the master, the replicas or all in debug mode?
How much extra logging would this generate?
A replica typically consumes over 1 TB of WAL files before a bad wal file
is encountered.

> > There is a third replica where this bug has not (yet?) surfaced.
> > This leads me to guess the bad data does not originate on the master.
> > This replica is older than the other replicas, slower CPUs, less RAM,
> and the WAL disk array is spinning disks.
> > The OS, version of Postgres, and version of ZFS are the same as the
> other replicas.
> > This replica is not using a replication slot.
> > This replica does not serve users so load/contention is much lower than
> the others.
> > The other replicas often have 100% utilization of the disk array that
> houses the (non-wal) data.
> >
> > Any insight into the source of this bug or how to address it?
> >
> > Since the master has a good copy of the WAL file, can the replica
> re-request the file from the master? Or from the archive?
> >
>
> I think we do check in the archive if we get the error during
> streaming, but archive might also have the same data due to which this
> problem happens. Have you checked that the archive WAL file, is it
> different from the bad WAL? See the

Typically the master, the archive and the other replicas all have a good
copy of the WAL file.

relevant bits of code in
> WaitForWALToBecomeAvailable especially the code near below comment:
>
> "Failure while streaming. Most likely, we got here because streaming
> replication was terminated, or promotion was triggered. But we also
> get here if we find an invalid record in the WAL streamed from master,
> in which case something is seriously wrong. There's little chance that
> the problem will just go away, but PANIC is not good for availability
> either, especially in hot standby mode. So, we treat that the same as
> disconnection, and retry from archive/pg_wal again. The WAL in the
> archive should be identical to what was streamed, so it's unlikely
> that it helps, but one can hope..."
>
>
Thank you for this comment!
This made me realize that on the replicas I had mentioned we had removed
the restore_command.
The replica we thought was not having the issue, was actually also
getting/producing bad WAL files but was eventually recovering by getting a
good WAL file from the archive b/c it had the restore_command defined.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-02-20 17:02:36 Re: error context for vacuum to include block number
Previous Message David Fetter 2020-02-20 16:43:26 Re: Parallel copy