| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | Alex Malek <magicagent(at)gmail(dot)com> | 
| Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org | 
| Subject: | Re: bad wal on replica / incorrect resource manager data checksum in record / zfs | 
| Date: | 2020-04-02 18:10:31 | 
| Message-ID: | 20200402181031.cvcola6xdegqrmmc@alap3.anarazel.de | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2020-02-19 16:35:53 -0500, Alex Malek wrote:
> We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> "incorrect resource manager data checksum in record at ..."
Could you show the *exact* log output please? Because this could
temporarily occur without signalling anything bad, if e.g. the
replication connection goes down.
> Right before the issue started we did some upgrades and altered some
> postgres configs and ZFS settings.
> We have been slowly rolling back changes but so far the the issue continues.
> 
> Some interesting data points while debugging:
> We had lowered the ZFS recordsize from 128K to 32K and for that week the
> issue started happening every other day.
> Using xxd and diff we compared "good" and "bad" wal files and the
> differences were not random bad bytes.
> 
> The bad file either had a block of zeros that were not in the good file at
> that position or other data.  Occasionally the bad data has contained
> legible strings not in the good file at that position. At least one of
> those exact strings has existed elsewhere in the files.
> However I am not sure if that is the case for all of them.
> 
> This made me think that maybe there was an issue w/ wal file recycling and
> ZFS under heavy load, so we tried lowering
> min_wal_size in order to "discourage" wal file recycling but my
> understanding is a low value discourages recycling but it will still
> happen (unless setting wal_recycle in psql 12).
This sounds a lot more like a broken filesystem than anythingon the PG
level.
> When using replication slots, what circumstances would cause the master to
> not save the WAL file?
What do you mean by "save the WAL file"?
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Robert Haas | 2020-04-02 18:16:27 | Re: backup manifests | 
| Previous Message | David Zhang | 2020-04-02 18:05:17 | Re: Allow continuations in "pg_hba.conf" files |