From: | Alex Malek <magicagent(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: bad wal on replica / incorrect resource manager data checksum in record / zfs |
Date: | 2020-02-26 15:18:30 |
Message-ID: | CAGH8cccpuYa6GE5-TCfkiGx_H-yF9XhS9p0snAWxctuu8PHE8Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Feb 20, 2020 at 12:01 PM Alex Malek <magicagent(at)gmail(dot)com> wrote:
> On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
>> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent(at)gmail(dot)com> wrote:
>> >
>> >
>> > Hello Postgres Hackers -
>> >
>> > We are having a reoccurring issue on 2 of our replicas where
>> replication stops due to this message:
>> > "incorrect resource manager data checksum in record at ..."
>> > This has been occurring on average once every 1 to 2 weeks during large
>> data imports (100s of GBs being written)
>> > on one of two replicas.
>> > Fixing the issue has been relatively straight forward: shutdown
>> replica, remove the bad wal file, restart replica and
>> > the good wal file is retrieved from the master.
>> > We are doing streaming replication using replication slots.
>> > However twice now, the master had already removed the WAL file so the
>> file had to retrieved from the wal archive.
>> >
>> > The WAL log directories on the master and the replicas are on ZFS file
>> systems.
>> > All servers are running RHEL 7.7 (Maipo)
>> > PostgreSQL 10.11
>> > ZFS v0.7.13-1
>> >
>> > The issue seems similar to
>> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
>> and to https://github.com/timescale/timescaledb/issues/1443
>> >
>> > One quirk in our ZFS setup is ZFS is not handling our RAID array, so
>> ZFS sees our array as a single device.
>> >
>> > Right before the issue started we did some upgrades and altered some
>> postgres configs and ZFS settings.
>> > We have been slowly rolling back changes but so far the the issue
>> continues.
>> >
>> > Some interesting data points while debugging:
>> > We had lowered the ZFS recordsize from 128K to 32K and for that week
>> the issue started happening every other day.
>> > Using xxd and diff we compared "good" and "bad" wal files and the
>> differences were not random bad bytes.
>> >
>> > The bad file either had a block of zeros that were not in the good file
>> at that position or other data. Occasionally the bad data has contained
>> legible strings not in the good file at that position. At least one of
>> those exact strings has existed elsewhere in the files.
>> > However I am not sure if that is the case for all of them.
>> >
>> > This made me think that maybe there was an issue w/ wal file recycling
>> and ZFS under heavy load, so we tried lowering
>> > min_wal_size in order to "discourage" wal file recycling but my
>> understanding is a low value discourages recycling but it will still
>> > happen (unless setting wal_recycle in psql 12).
>> >
>>
>> We do print a message "recycled write-ahead log file .." in DEBUG2
>> mode. You either want to run the server with DEBUG2 or maybe change
>> the code to make it LOG and see if that is printed. If you do that,
>> you can verify if the corrupted WAL is the same as a recycled one.
>>
>
> Are you suggesting having the master, the replicas or all in debug mode?
> How much extra logging would this generate?
> A replica typically consumes over 1 TB of WAL files before a bad wal file
> is encountered.
>
>
>
>> > There is a third replica where this bug has not (yet?) surfaced.
>> > This leads me to guess the bad data does not originate on the master.
>> > This replica is older than the other replicas, slower CPUs, less RAM,
>> and the WAL disk array is spinning disks.
>> > The OS, version of Postgres, and version of ZFS are the same as the
>> other replicas.
>> > This replica is not using a replication slot.
>> > This replica does not serve users so load/contention is much lower than
>> the others.
>> > The other replicas often have 100% utilization of the disk array that
>> houses the (non-wal) data.
>> >
>> > Any insight into the source of this bug or how to address it?
>> >
>> > Since the master has a good copy of the WAL file, can the replica
>> re-request the file from the master? Or from the archive?
>> >
>>
>> I think we do check in the archive if we get the error during
>> streaming, but archive might also have the same data due to which this
>> problem happens. Have you checked that the archive WAL file, is it
>> different from the bad WAL? See the
>
>
> Typically the master, the archive and the other replicas all have a good
> copy of the WAL file.
>
> relevant bits of code in
>> WaitForWALToBecomeAvailable especially the code near below comment:
>>
>> "Failure while streaming. Most likely, we got here because streaming
>> replication was terminated, or promotion was triggered. But we also
>> get here if we find an invalid record in the WAL streamed from master,
>> in which case something is seriously wrong. There's little chance that
>> the problem will just go away, but PANIC is not good for availability
>> either, especially in hot standby mode. So, we treat that the same as
>> disconnection, and retry from archive/pg_wal again. The WAL in the
>> archive should be identical to what was streamed, so it's unlikely
>> that it helps, but one can hope..."
>>
>>
> Thank you for this comment!
> This made me realize that on the replicas I had mentioned we had removed
> the restore_command.
> The replica we thought was not having the issue, was actually also
> getting/producing bad WAL files but was eventually recovering by getting a
> good WAL file from the archive b/c it had the restore_command defined.
>
>
So ignoring what is causing the underlying issue, what would be involved in
adding the ability of the replica to try to re-request the WAL file first
from the master? It seems that would make replication more resilient and
address similar issues such as
https://www.postgresql.org/message-id/CAPv0rXGZtFr2u5o3g70OMoH+WQYhmwq1aGsmL+PQHMjFf71Dkw@mail.gmail.com
that do not involve ZFS at all.
Thanks.
Alex
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2020-02-26 15:29:13 | Re: Commit fest manager for 2020-03 |
Previous Message | Ants Aasma | 2020-02-26 15:17:33 | Re: Parallel copy |