Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

From: Alex Malek <magicagent(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: bad wal on replica / incorrect resource manager data checksum in record / zfs
Date: 2020-02-26 15:18:30
Message-ID: CAGH8cccpuYa6GE5-TCfkiGx_H-yF9XhS9p0snAWxctuu8PHE8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 20, 2020 at 12:01 PM Alex Malek <magicagent(at)gmail(dot)com> wrote:

> On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
>> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent(at)gmail(dot)com> wrote:
>> >
>> >
>> > Hello Postgres Hackers -
>> >
>> > We are having a reoccurring issue on 2 of our replicas where
>> replication stops due to this message:
>> > "incorrect resource manager data checksum in record at ..."
>> > This has been occurring on average once every 1 to 2 weeks during large
>> data imports (100s of GBs being written)
>> > on one of two replicas.
>> > Fixing the issue has been relatively straight forward: shutdown
>> replica, remove the bad wal file, restart replica and
>> > the good wal file is retrieved from the master.
>> > We are doing streaming replication using replication slots.
>> > However twice now, the master had already removed the WAL file so the
>> file had to retrieved from the wal archive.
>> >
>> > The WAL log directories on the master and the replicas are on ZFS file
>> systems.
>> > All servers are running RHEL 7.7 (Maipo)
>> > PostgreSQL 10.11
>> > ZFS v0.7.13-1
>> >
>> > The issue seems similar to
>> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
>> and to https://github.com/timescale/timescaledb/issues/1443
>> >
>> > One quirk in our ZFS setup is ZFS is not handling our RAID array, so
>> ZFS sees our array as a single device.
>> >
>> > Right before the issue started we did some upgrades and altered some
>> postgres configs and ZFS settings.
>> > We have been slowly rolling back changes but so far the the issue
>> continues.
>> >
>> > Some interesting data points while debugging:
>> > We had lowered the ZFS recordsize from 128K to 32K and for that week
>> the issue started happening every other day.
>> > Using xxd and diff we compared "good" and "bad" wal files and the
>> differences were not random bad bytes.
>> >
>> > The bad file either had a block of zeros that were not in the good file
>> at that position or other data. Occasionally the bad data has contained
>> legible strings not in the good file at that position. At least one of
>> those exact strings has existed elsewhere in the files.
>> > However I am not sure if that is the case for all of them.
>> >
>> > This made me think that maybe there was an issue w/ wal file recycling
>> and ZFS under heavy load, so we tried lowering
>> > min_wal_size in order to "discourage" wal file recycling but my
>> understanding is a low value discourages recycling but it will still
>> > happen (unless setting wal_recycle in psql 12).
>> >
>>
>> We do print a message "recycled write-ahead log file .." in DEBUG2
>> mode. You either want to run the server with DEBUG2 or maybe change
>> the code to make it LOG and see if that is printed. If you do that,
>> you can verify if the corrupted WAL is the same as a recycled one.
>>
>
> Are you suggesting having the master, the replicas or all in debug mode?
> How much extra logging would this generate?
> A replica typically consumes over 1 TB of WAL files before a bad wal file
> is encountered.
>
>
>
>> > There is a third replica where this bug has not (yet?) surfaced.
>> > This leads me to guess the bad data does not originate on the master.
>> > This replica is older than the other replicas, slower CPUs, less RAM,
>> and the WAL disk array is spinning disks.
>> > The OS, version of Postgres, and version of ZFS are the same as the
>> other replicas.
>> > This replica is not using a replication slot.
>> > This replica does not serve users so load/contention is much lower than
>> the others.
>> > The other replicas often have 100% utilization of the disk array that
>> houses the (non-wal) data.
>> >
>> > Any insight into the source of this bug or how to address it?
>> >
>> > Since the master has a good copy of the WAL file, can the replica
>> re-request the file from the master? Or from the archive?
>> >
>>
>> I think we do check in the archive if we get the error during
>> streaming, but archive might also have the same data due to which this
>> problem happens. Have you checked that the archive WAL file, is it
>> different from the bad WAL? See the
>
>
> Typically the master, the archive and the other replicas all have a good
> copy of the WAL file.
>
> relevant bits of code in
>> WaitForWALToBecomeAvailable especially the code near below comment:
>>
>> "Failure while streaming. Most likely, we got here because streaming
>> replication was terminated, or promotion was triggered. But we also
>> get here if we find an invalid record in the WAL streamed from master,
>> in which case something is seriously wrong. There's little chance that
>> the problem will just go away, but PANIC is not good for availability
>> either, especially in hot standby mode. So, we treat that the same as
>> disconnection, and retry from archive/pg_wal again. The WAL in the
>> archive should be identical to what was streamed, so it's unlikely
>> that it helps, but one can hope..."
>>
>>
> Thank you for this comment!
> This made me realize that on the replicas I had mentioned we had removed
> the restore_command.
> The replica we thought was not having the issue, was actually also
> getting/producing bad WAL files but was eventually recovering by getting a
> good WAL file from the archive b/c it had the restore_command defined.
>
>

So ignoring what is causing the underlying issue, what would be involved in
adding the ability of the replica to try to re-request the WAL file first
from the master? It seems that would make replication more resilient and
address similar issues such as
https://www.postgresql.org/message-id/CAPv0rXGZtFr2u5o3g70OMoH+WQYhmwq1aGsmL+PQHMjFf71Dkw@mail.gmail.com
that do not involve ZFS at all.

Thanks.
Alex

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-02-26 15:29:13 Re: Commit fest manager for 2020-03
Previous Message Ants Aasma 2020-02-26 15:17:33 Re: Parallel copy