Re: Fix logical decoding not track transaction during SNAPBUILD_BUILDING_SNAPSHOT

From: ocean_li_996 <ocean_li_996(at)163(dot)com>
To: "Ajin Cherian" <itsajin(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Fix logical decoding not track transaction during SNAPBUILD_BUILDING_SNAPSHOT
Date: 2026-01-29 19:27:03
Message-ID: 1d8ffe1.9688.19c0b39303e.Coremail.ocean_li_996@163.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Ajin,

At 2026-01-29 11:33:59, "Ajin Cherian" <itsajin(at)gmail(dot)com> wrote:
>I tested with patch 0002+0004 on HEAD, and the test added by patch
>0002 is failing like below.
>
>not ok 15 - snapshot_build 289 ms
># (test process exited with exit code 1)
>
>I see the postgres crashed and when I look at the core file, I see the
>below stack trace:
>Core was generated by
>`/home/ajin/postgresql/postgres3/postgres/tmp_install/home/ajin/install-oss/bin/postgres
>'' '' '' '' '' '' '''.
>Program terminated with signal SIGABRT, Aborted.
>#0 0x00007436092c7e9c in __pthread_kill_implementation () from /lib64/libc.so.6
>Missing rpms, try: dnf --enablerepo='*debug*' install
>zlib-ng-compat-debuginfo-2.2.3-2.el10.x86_64
>glibc-debuginfo-2.39-38.el10.x86_64
>libicu-debuginfo-74.2-5.el10.x86_64
>libstdc++-debuginfo-14.3.1-2.1.el10.x86_64
>libgcc-debuginfo-14.3.1-2.1.el10.x86_64
>(gdb) bt
>#0 0x00007436092c7e9c in __pthread_kill_implementation () from /lib64/libc.so.6
>#1 0x0000743609271a96 in raise () from /lib64/libc.so.6
>#2 0x00007436092598fa in abort () from /lib64/libc.so.6
>#3 0x0000000000a00a93 in ExceptionalCondition
>(conditionName=conditionName(at)entry=0xb1e039 "txn->ninvalidations ==
>0", fileName=fileName(at)entry=0xb1dcd8 "reorderbuffer.c",
>lineNumber=lineNumber(at)entry=3207) at assert.c:65
>#4 0x0000000000836c94 in ReorderBufferForget (rb=0x3cf96a40,
>xid=xid(at)entry=1136, lsn=34819568) at reorderbuffer.c:3207
>#5 0x0000000000823aea in DecodeCommit (ctx=ctx(at)entry=0x3cf869d0,
>buf=buf(at)entry=0x7ffd069389a0, parsed=parsed(at)entry=0x7ffd069387d0,
>xid=xid(at)entry=1136, two_phase=false) at decode.c:707
>#6 0x000000000082497c in xact_decode (ctx=ctx(at)entry=0x3cf869d0,
>buf=buf(at)entry=0x7ffd069389a0) at decode.c:237
>#7 0x00000000008246eb in LogicalDecodingProcessRecord
>(ctx=ctx(at)entry=0x3cf869d0, record=0x3cf86da8) at decode.c:116
>#8 0x0000000000829602 in DecodingContextFindStartpoint
>(ctx=ctx(at)entry=0x3cf869d0) at logical.c:647
>#9 0x000000000084e8ea in create_logical_replication_slot
>(name=name(at)entry=0x3ce8fb70 "isolation_slot",
>plugin=plugin(at)entry=0x3ce8fc10 "test_decoding",
>temporary=temporary(at)entry=false, two_phase=two_phase(at)entry=false,
>failover=failover(at)entry=false,
> restart_lsn=restart_lsn(at)entry=0, find_startpoint=true) at slotfuncs.c:177
>#10 0x000000000084f300 in pg_create_logical_replication_slot
>(fcinfo=<optimized out>) at slotfuncs.c:207
>#11 0x00000000006c77c4 in ExecMakeTableFunctionResult
>(setexpr=0x3cf61b18, econtext=0x3cf61968, argContext=<optimized out>,
>expectedDesc=0x3cf79f48, randomAccess=false) at execSRF.c:234
>#12 0x00000000006da839 in FunctionNext (node=node(at)entry=0x3cf61758) at
>nodeFunctionscan.c:94
>
>(gdb) f 4
>#4 0x0000000000836c94 in ReorderBufferForget (rb=0x3cf96a40,
>xid=xid(at)entry=1136, lsn=34819568) at reorderbuffer.c:3207
>3207 Assert(txn->ninvalidations == 0);
>(gdb) p *txn
>$1 = {txn_flags = 1, xid = 1136, toplevel_xid = 0, gid = 0x0,
>first_lsn = 34814328, final_lsn = 34819568, end_lsn = 0, toptxn = 0x0,
>restart_decoding_lsn = 0, origin_id = 0, origin_lsn = 0, {commit_time
>= 0, prepare_time = 0, abort_time = 0}, base_snapshot = 0x0,
> base_snapshot_lsn = 0, base_snapshot_node = {prev = 0x0, next =
>0x0}, snapshot_now = 0x0, command_id = 4294967295, nentries = 14,
>nentries_mem = 14, changes = {head = {prev = 0x3cfb5510, next =
>0x3cfb4ae8}}, tuplecids = {head = {prev = 0x3cfb5440, next =
>0x3cfb4a80}},
> ntuplecids = 13, tuplecid_hash = 0x0, toast_hash = 0x0, subtxns =
>{head = {prev = 0x3cfb2c68, next = 0x3cfb2c68}}, nsubtxns = 0,
>ninvalidations = 22, invalidations = 0x3cf96c50,
>ninvalidations_distributed = 0, invalidations_distributed = 0x0, node
>= {prev = 0x3cfb2b30,
> next = 0x3cf96a48}, catchange_node = {prev = 0x3cf96a68, next =
>0x3cf96a68}, txn_node = {first_child = 0x0, next_sibling = 0x0,
>prev_or_parent = 0x0}, size = 1472, total_size = 1472,
>output_plugin_private = 0x0}
>
>Looks like the assert in ReorderBufferForget failed because
>ninvalidations is not 0.
>
>You need to build with asserts enabled.

Oops, I forgot to add the --enable-cassert option. After adding it,
I was able to reproduce the issue.

I spent some time researching the code. Now, I am wonder about the
relationship between the invalidation message and the base snapshot
of that XID.

1) IIUC, current design requires that an XID with an invalidation message
must have a base snapshot (see the handling in SnapBuildCommitTxn and
ReorderBufferForget regarding base snapshots). This is consistent with
the requirement when adding other regular changes. In that case, when
adding an invalidation message, should we also apply the same checks as
for other regular changes? Specifically, for transactions without enough
information to decode, we would skip adding the invalidation message (see
SnapBuildProcessChange). This should be fine because at that point no
decoding has been done yet, so there is no cache to invalidate. Based on
this idea, I have updated the 0003 and 0004 patches (see attached v7.1-0003
and v7.1-0004).

2) When executing an invalidation message with an XID, do we necessarily
require that the XID has a base snapshot? I did not see a strict relationship
between the two. In other words, when executing an invalidation message, we
could potentially remove the check for xid->base_snapshot. From another perspective,
before executing an invalidation message, the current logic already processes
the base snapshot, and a base snapshot may be not present (see the handling
in SnapBuildCommitTxn with if (needs_snapshot)). Based on this idea, I have
updated the 0003 and 0004 patches (see attached v7.2-0003 and v7.2-0004).

Regards,
Haiyang Li

Attachment Content-Type Size
v7.1-0003-Track-transaction-committed-in-BUILDING_SNAPSHOT.patch application/octet-stream 1.5 KB
v7.1-0004-Track-transaction-committed-in-BUILDING_SNAPSHOT.patch application/octet-stream 4.3 KB
v7.2-0003-Track-transaction-committed-in-BUILDING_SNAPSHOT.patch application/octet-stream 2.5 KB
v7.2-0004-Track-transaction-committed-in-BUILDING_SNAPSHOT.patch application/octet-stream 5.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2026-01-29 19:29:27 Re: Buffer locking is special (hints, checksums, AIO writes)
Previous Message Corey Huinker 2026-01-29 19:20:26 Re: Import Statistics in postgres_fdw before resorting to sampling.