Re: Corruption during WAL replay

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, deniel1495(at)mail(dot)ru, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, tejeswarm(at)hotmail(dot)com, hlinnaka <hlinnaka(at)iki(dot)fi>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Daniel Wood <hexexpert(at)comcast(dot)net>
Subject: Re: Corruption during WAL replay
Date: 2022-03-25 05:23:00
Message-ID: 3192026.1648185780@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that there's no variation due to autovacuum, checkpoints etc.

Uh-huh. I'm not surprised that it's repeatable on a given animal.
What remains to be explained:

1. Why'd it start failing now? I'm guessing that ce95c5437 *was* the
culprit after all, by slightly changing the amount of catalog data
written during initdb, and thus moving the initial LSN.

2. Why just these two animals? If initial LSN is the critical thing,
then the results of "locale -a" would affect it, so platform
dependence is hardly surprising ... but I'd have thought that all
the animals on that host would use the same initial set of
collations. OTOH, I see petalura and pogona just fell over too.
Do you have some of those animals --with-icu and others not?

> 16bit checksums for the win.

Yay :-(

As for a fix, would damaging more of the page help? I guess
it'd just move around the one-in-64K chance of failure.
Maybe we have to intentionally corrupt (e.g. invert) the
checksum field specifically.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message wangw.fnst@fujitsu.com 2022-03-25 05:23:05 RE: Logical replication timeout problem
Previous Message Kyotaro Horiguchi 2022-03-25 05:22:56 Re: shared-memory based stats collector - v66