Assertion failure in SnapBuildInitialSnapshot()

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Assertion failure in SnapBuildInitialSnapshot()
Date: 2022-11-10 10:34:40
Message-ID: CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thomas has reported this failure in an email [1] and shared the
following links offlist with me:
https://cirrus-ci.com/task/5311549010083840
https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/testrun/build/testrun/subscription/100_bugs/log/100_bugs_twoways.log
https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/crashlog/crashlog-postgres.exe_1c40_2022-11-08_00-20-28-110.txt

The call stack is as follows:
00000063`4edff670 00007ff7`1922fcdf postgres!ExceptionalCondition(
char * conditionName = 0x00007ff7`198f8050
"TransactionIdPrecedesOrEquals(safeXid, snap->xmin)",
char * fileName = 0x00007ff7`198f8020
"../src/backend/replication/logical/snapbuild.c",
int lineNumber = 0n600)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
00000063`4edff6b0 00007ff7`192106df postgres!SnapBuildInitialSnapshot(
struct SnapBuild * builder = 0x00000251`5b95bce8)+0x20f
[c:\cirrus\src\backend\replication\logical\snapbuild.c @ 600]
00000063`4edff730 00007ff7`1920d9f6 postgres!CreateReplicationSlot(
struct CreateReplicationSlotCmd * cmd = 0x00000251`5b94d828)+0x40f
[c:\cirrus\src\backend\replication\walsender.c @ 1152]
00000063`4edff870 00007ff7`192bc9c4 postgres!exec_replication_command(
char * cmd_string = 0x00000251`5b94ac68 "CREATE_REPLICATION_SLOT
"pg_16400_sync_16392_7163433409941550636" LOGICAL pgoutput (SNAPSHOT
'use')")+0x4a6 [c:\cirrus\src\backend\replication\walsender.c @ 1804]

As per my investigation based on the above logs, the failed test is
due to the following command in 100_bugs.pl:
$node_twoways->safe_psql('d2',
"CREATE SUBSCRIPTION testsub CONNECTION \$\$"
. $node_twoways->connstr('d1')
. "\$\$ PUBLICATION testpub WITH (create_slot=false, "
. "slot_name='testslot')");

It failed while creating the table sync slot.

The failure happens because the xmin computed by the snap builder is
lesser than what is computed by GetOldestSafeDecodingTransactionId()
during initial snapshot creation for the tablesync slot by
SnapBuildInitialSnapshot.

To investigate, I tried to study how the values of "safeXid" and
"snap->xmin" are computed in SnapBuildInitialSnapshot(). There appear
to be four places in the code where we assign value to xmin
(builder-xmin) during the snapshot building process and then we assign
the same to snap->xmin. Those places are: (a) Two places in
SnapBuildFindSnapshot(), (b) One place in SnapBuildRestore(), and (c)
One place in SnapBuildProcessRunningXacts()

Seeing the LOGS, it appears to me that we find a consistent point from
the below code in SnapBuildFindSnapshot() and the following line
assigns builder->xmin.

...
if (running->oldestRunningXid == running->nextXid)
{
...
builder->xmin = running->nextXid;

The reason is we only see "logical decoding found consistent point at
..." in LOGs. If SnapBuildFindSnapshot() has to go through the entire
machinery of snapshot building then, we should have seen "logical
decoding found initial starting point at ..." and similar other LOGs.
The builder->xmin can't be assigned from any other place in (b) or
(c). From (c), it can't be assigned because we are building a full
snapshot here which won't restore any serialized snapshot. It can't be
assigned from (b) because while creating a slot we stop as soon as we
find the consistent point, see
DecodingContextFindStartpoint()->DecodingContextReady()

In the above code snippet, the running->nextXid in the above code
snippet has been assigned from ShmemVariableCache->nextXid in
GetRunningTransactionData().

The safeXid computed from GetOldestSafeDecodingTransactionId() uses
ShmemVariableCache->nextXid as the starting point and keeps the slot's
xmin as the safe Xid limit.

It seems to me due to SnapBuilder's initial_xmin_horizon, we won't set
(SnapBuilder's) xmin lesser than slot's effective_xmin computed in
CreateInitDecodingContext(). See SnapBuildFindSnapshot(). So, ideally,
SnapBuildInitialSnapshot() should never compute safeXid which is based
on the minimum of all slot's effective_xmin to be greater than
SnapBuilder's xmin (or snapshot's xmin). In short, the code as written
seems correct to me.

I have also tried to analyze if any recent commit (7f13ac8123) could
cause this problem but can't think of any reason because the changes
are related to the restart of decoding and the failed test is related
to creating a slot the very first time.

I don't have any good ideas on how to proceed with this. Any thoughts
on this would be helpful?

Note: I have briefly discussed this issue with Sawada-San and
Kuroda-San, so kept them in Cc.

[1] - https://www.postgresql.org/message-id/CA%2BhUKG%2BA_LyW%3DFAOi2ebA9Vr-1%3Desu%2BeBSm0dsVhU%3DEgfpipkg%40mail.gmail.com

--
With Regards,
Amit Kapila.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message sirisha chamarthi 2022-11-10 10:37:52 Re: Reviving lost replication slots
Previous Message Bharath Rupireddy 2022-11-10 10:26:35 Re: Printing backtrace of postgres processes