Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Alex Malek <amalek(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: bad wal on replica / incorrect resource manager data checksum in record / zfs
Date: 2020-02-21 03:22:39
Message-ID: CAA4eK1JQXRKonq5dL2XS7ZZmRTNkik91WH1VYhBXTjfxaprHMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 20, 2020 at 7:40 PM Alex Malek <amalek(at)gmail(dot)com> wrote:
>
> On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent(at)gmail(dot)com> wrote:
>> >
>> > Some interesting data points while debugging:
>> > We had lowered the ZFS recordsize from 128K to 32K and for that week the issue started happening every other day.
>> > Using xxd and diff we compared "good" and "bad" wal files and the differences were not random bad bytes.
>> >
>> > The bad file either had a block of zeros that were not in the good file at that position or other data. Occasionally the bad data has contained legible strings not in the good file at that position. At least one of those exact strings has existed elsewhere in the files.
>> > However I am not sure if that is the case for all of them.
>> >
>> > This made me think that maybe there was an issue w/ wal file recycling and ZFS under heavy load, so we tried lowering
>> > min_wal_size in order to "discourage" wal file recycling but my understanding is a low value discourages recycling but it will still
>> > happen (unless setting wal_recycle in psql 12).
>> >
>>
>> We do print a message "recycled write-ahead log file .." in DEBUG2
>> mode. You either want to run the server with DEBUG2 or maybe change
>> the code to make it LOG and see if that is printed. If you do that,
>> you can verify if the corrupted WAL is the same as a recycled one.
>
>
> Are you suggesting having the master, the replicas or all in debug mode?
>

The system(s) where you are expecting that wal recycling would have
created some problem.

> How much extra logging would this generate?
>

To some extent, it depends on your workload. It will certainly
generate much more than when you have not enabled the debug level.
But, what other option you have to identify the root cause or at least
find out whether your suspicion is right or not. As mentioned
earlier, if you have the flexibility of changing code to find out the
reason, then you can change the code (at the place I told yesterday)
to make the level as LOG in which case you can set the
log_min_messages to LOG and it will generate much fewer logs on the
server.

> A replica typically consumes over 1 TB of WAL files before a bad wal file is encountered.
>
>
>> >
>> > Any insight into the source of this bug or how to address it?
>> >
>> > Since the master has a good copy of the WAL file, can the replica re-request the file from the master? Or from the archive?
>> >
>>
>> I think we do check in the archive if we get the error during
>> streaming, but archive might also have the same data due to which this
>> problem happens. Have you checked that the archive WAL file, is it
>> different from the bad WAL? See the
>
>
> Typically the master, the archive and the other replicas all have a good copy of the WAL file.
>
>> relevant bits of code in
>> WaitForWALToBecomeAvailable especially the code near below comment:
>>
>> "Failure while streaming. Most likely, we got here because streaming
>> replication was terminated, or promotion was triggered. But we also
>> get here if we find an invalid record in the WAL streamed from master,
>> in which case something is seriously wrong. There's little chance that
>> the problem will just go away, but PANIC is not good for availability
>> either, especially in hot standby mode. So, we treat that the same as
>> disconnection, and retry from archive/pg_wal again. The WAL in the
>> archive should be identical to what was streamed, so it's unlikely
>> that it helps, but one can hope..."
>>
>
> Thank you for this comment!
> This made me realize that on the replicas I had mentioned we had removed the restore_command.
> The replica we thought was not having the issue, was actually also getting/producing bad WAL files but was eventually recovering by getting a good WAL file from the archive b/c it had the restore_command defined.
>

Good to know that there is some way to recover from the situation.
But, I think it is better to find the root cause of what led to bad
WAL files so that you can fix it if possible.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-02-21 04:00:01 Re: allow running parts of src/tools/msvc/ under not Windows
Previous Message Michael Paquier 2020-02-21 03:08:34 Re: Clean up some old cruft related to Windows