| From: | Ishan joshi <ishanjoshi(at)live(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me>, "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: Replication to standby broke with WAL file corruption |
| Date: | 2026-03-16 06:04:51 |
| Message-ID: | LV8PR84MB37866B6A6920CABB67F4E394A940A@LV8PR84MB3786.NAMPRD84.PROD.OUTLOOK.COM |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
Thanks Tomas for reply.
1663/33195/410203483 is table created by user through some transaction, However the transcation got broke and rollback. Which makes the table drop from the primary and it is not impacted. However the WAL file seems to be corrupt at this point where the transaction carrying create table->DML -> rollback, DML is logged first and the same is applying to standby and DR where the table is not created. Looks like RACE condition while writing WAL file.
This is common scenario, if transaction got broken, it should rollback the transaction and the sequence of the transaction should be logged in WAL file. In this case, DML operation comes before table creation in WAL which broke the replication.
Thanks & Regards,
Ishan Joshi
________________________________
From: Tomas Vondra <tomas(at)vondra(dot)me>
Sent: 16 March 2026 04:39
To: Ishan joshi <ishanjoshi(at)live(dot)com>; pgsql-general(at)lists(dot)postgresql(dot)org <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Replication to standby broke with WAL file corruption
On 3/13/26 11:41, Ishan joshi wrote:
> Hi Team,
>
> I found an issue with PG v16.9 patroni setup where our standby node
> replication and disaster replication site replication broken with below
> error. It looks like WAL corruption which later part of archive file.
>
>
> CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
> off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0: rel
> 1663/33195/410203483, blk 25329"
> PANIC: WAL contains references to invalid pages"
> CONTEXT: WAL redo at 184F3/F248B6F0 for Heap/LOCK: xmax: 2818115117,
> off:35, infobits: [LOCK_ONLY, EXCL_LOCK], flags: 0x00; blkref #0:
> rel1663/33195/410203483, blk 25329"
> WARNING: page 25329 of relation base/33195/410203483 does not exist"
> INFO: no action. I am (pg-patroni-node1-0), a secondary, and following a
> leader (pg-patroni-node2-0)"
> [61]LOG: terminating any other active server processes"
> [61]LOG: startup process (PID 72) was terminated by signal 6: Aborted"
> [61]LOG: shutting down due to startup process failure"
> [61]LOG: database system is shut down"
> INFO: establishing a new patroni heartbeat connection to postgres"
> INFO: Lock owner: pg-patroni-node2-0; I am pg-patroni-node1-0"
> WARNING: Retry got exception: connection problems"
> WARNING: Failed to determine PostgreSQL state from the connection,
> fallingback to cached role"
> INFO: Error communicating with PostgreSQL. Will try again later"
> WARNING: Postgresql is not running."
>
>
> Primary db was not impacted, however standby node and DR site
> replication broken, I tried to reinit with latest backup + archive
> loading from pgbackrest backup but it fails with same error once the
> corrupt wal/archive file applying the changes. I had to reinit with
> pgbasebackup with 40TB database which took about 45 hrs of time.
>
> As I understand the transcation create table ->performed DML and then
> drop the table or transaction could be rollback that makes RACE
> condition in WAL file creation and got failed while applying the same in
> standby/DR site.
>
It's hard to say what caused this, but it might be interesting to look
at the WAL using pg_waldump. First at the WAL segment containing the
record triggering the failure, and then also at WAL segments before that
containing references to relation 1663/33195/410203483 (and especially
page 25329).
It is interesting this succeeded on a primary, but failed on standby.
Is there anything special about the relation 1663/33195/410203483? Do
you know if it's a regular / temporary table, etc?
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Dominique Devienne | 2026-03-16 10:25:15 | Re: libpq usage from C++ |
| Previous Message | Igor Korot | 2026-03-16 01:23:57 | Re: Does included columns part of the PK |