Re: BUG #19109: catalog lookup with the wrong snapshot during logical decoding causes coredump

From: "Haiyang Li" <mohen(dot)lhy(at)alibaba-inc(dot)com>
To: "Michael Paquier" <michael(at)paquier(dot)xyz>
Cc: "Xuneng Zhou" <xunengzhou(at)gmail(dot)com>, "pgsql-bugs" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #19109: catalog lookup with the wrong snapshot during logical decoding causes coredump
Date: 2025-11-12 06:35:22
Message-ID: 645b19e1-98c4-42d1-a175-f7f1e7d8d17d.mohen.lhy@alibaba-inc.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Haiyang Li <mohen(dot)lhy(at)alibaba-inc(dot)com> 2025-11-11日 10:46 I wrote:
> From:Michael Paquier <michael(at)paquier(dot)xyz> 2025-11-11日 10:12 wrote:
>
>> That's unfortunate. Having a reproducible test case that works with
>> upstream would speed up the analysis of the problem a lot.
>
>
> Indeed. I can reproduce the core in original instance by using pg_logical_slot_peek_changes
> fucntion. But I can’t reproduce in other new instances. I will try to reproduce this issue again later.
I can’t reproduce the core yet. But I think the fowllowing script can make the similar scene. The
differentnce is we will get a “could not map filenumber xxx ” ERROR finally.
Script:
```
-- Please extend the checkpoint and background writer intervals to avoid additional RUNNING_XACT WAL
-- being generated during testing. I have done this by modifying the LOG_SNAPSHOT_INTERVAL_MS
-- and CheckPointTimeout GUC settings.
-- s1(session 1)
CREATE TABLE t1 (
id integer NOT NULL,
unique1 integer,
hundred numeric,
tenthous numeric)
WITH (toast_tuple_target='128');
ALTER TABLE ONLY t1 ALTER COLUMN hundred SET STORAGE EXTENDED;
ALTER TABLE ONLY t1 REPLICA IDENTITY FULL;
CREATE TABLE t2 (
id integer NOT NULL,
unique1 integer,
hundred numeric,
tenthous numeric)
WITH (toast_tuple_target='128');
ALTER TABLE ONLY t2 ALTER COLUMN hundred SET STORAGE EXTENDED;
ALTER TABLE ONLY t2 REPLICA IDENTITY FULL;
select pg_create_logical_replication_slot('test', 'test_decoding’);
begin;
insert into t1 values (1,1,1,1);
select pg_log_standby_snapshot();
commit;
begin;
insert into t2 values (1,1,1,1);
-- s2
ALTER TABLE IF EXISTS t1 ALTER COLUMN unique1 SET DATA TYPE NUMERIC;
-- s1
select pg_log_standby_snapshot();
commit;
select pg_logical_slot_get_changes('test', NULL, NULL);
begin;
insert into t1 values (1,1,1,1);
select pg_log_standby_snapshot();
commit;
\q
-- remove saved snapshot
rm -rf {data_directory}/pg_logical/snapshots/*.snap
-- s1
-- error
select pg_logical_slot_peek_changes('test', NULL, NULL);
```
>
> While debugging the original instance using gdb, I found that txn 10722 was skipped due to
> SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT, causing the catalog snapshot
> to miss its commit. This suggests that the root cause may be in the new transaction handling
> logic during snapshot building.
>
The key point is saved useful snapshot files are removed (It does not occur during normal running,
But we did this in our testing). It resulted in a case where logical decoding was running with a
confirmed_flush_lsn that was lower than the LSN at which builder->state reached the
SNAPBUILD_CONSISTENT state. Then, decoding will generate an incorrect catalog snapshot,
which can lead to a variety of unintended consequences — including core dumps, errors, data loss,
or, in some cases, no visible effect at all.
I agree that removing valid snapshot files is not a sensible approach. However, if this condition can
be detected and reported with a clear and consistent error message, rather than allowing undefined
or unexpected behavior, it would provide a much better and more user‑friendly experience.
----------
Regards,
Haiyang Li

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2025-11-12 06:38:47 Re: BUG #19095: Test if function exit() is used fail when linked static
Previous Message BharatDB 2025-11-12 05:54:31 Fwd: BUG #19095: Test if function exit() is used fail when linked static