Re: Replication to standby broke with WAL file corruption

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Ishan joshi <ishanjoshi(at)live(dot)com>, "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Replication to standby broke with WAL file corruption
Date: 2026-03-15 23:09:05
Message-ID: eb716c9a-983c-4ba3-9ceb-7d60d1825e4f@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 3/13/26 11:41, Ishan joshi wrote:
> Hi Team,
>
> I found an issue with PG v16.9 patroni setup where our standby node
> replication and disaster replication site replication broken with below
> error. It looks like WAL corruption which later part of archive file.
>
>
> CONTEXT:  WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
> off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel
> 1663/33195/410203483, blk 25329"
> PANIC:  WAL contains references to invalid pages"
> CONTEXT:  WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
> off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0:
> rel1663/33195/410203483, blk 25329"
> WARNING:  page 25329 of relation base/33195/410203483 does not exist"
> INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a
> leader (pg-patroni-node2-0)"
> [61]LOG:  terminating any other active server processes"
> [61]LOG:  startup process (PID 72) was terminated by signal 6: Aborted"
> [61]LOG:  shutting down due to startup process failure"
> [61]LOG:  database system is shut down"
> INFO: establishing a new patroni heartbeat connection to postgres"
> INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0"
> WARNING: Retry got exception: connection problems"
> WARNING: Failed to determine PostgreSQL state from the connection,
> fallingback to cached role"
> INFO: Error communicating with PostgreSQL. Will try again later"
> WARNING: Postgresql is not running."
>
>
> Primary db was not impacted, however standby node and DR site
> replication broken, I tried to reinit with latest backup + archive
> loading from pgbackrest backup but it fails with same error once the
> corrupt wal/archive file applying the changes. I had to reinit with
> pgbasebackup with 40TB database which took about 45 hrs of time.
>
> As I understand the transcation create table ->performed DML and then
> drop the table or transaction could be rollback that makes RACE
> condition in WAL file creation and got failed while applying the same in
> standby/DR site.
>

It's hard to say what caused this, but it might be interesting to look
at the WAL using pg_waldump. First at the WAL segment containing the
record triggering the failure, and then also at WAL segments before that
containing references to relation 1663/33195/410203483 (and especially
page 25329).

It is interesting this succeeded on a primary, but failed on standby.

Is there anything special about the relation 1663/33195/410203483? Do
you know if it's a regular / temporary table, etc?

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tomas Vondra 2026-03-15 23:38:54 Re: Index scan with bitmap filter - has this been explored
Previous Message Adrian Klaver 2026-03-15 15:34:13 Re: Does included columns part of the PK