Re: test_decoding assertion failure for the loss of top-sub transaction relationship

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "osumi(dot)takamichi(at)fujitsu(dot)com" <osumi(dot)takamichi(at)fujitsu(dot)com>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: test_decoding assertion failure for the loss of top-sub transaction relationship
Date: 2022-09-03 04:43:49
Message-ID: CAA4eK1LK1nxOTL32OP=ejhPoBsUP4Bvwb3Ly=fethyJ-KbaXyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Sep 2, 2022 at 6:26 AM osumi(dot)takamichi(at)fujitsu(dot)com
<osumi(dot)takamichi(at)fujitsu(dot)com> wrote:
>
>
> I've met an assertion failure of logical decoding with below scenario on HEAD.
>
> ---
> <preparation>
> create table tab1 (val integer);
> select 'init' from pg_create_logical_replication_slot('regression_slot', 'test_decoding');
>
> <session1>
> begin;
> savepoint sp1;
> insert into tab1 values (1);
>
> <session2>
> checkpoint; -- for RUNNING_XACT
> select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
>
> <session1>
> truncate tab1; -- for NEW_CID
> commit;
> begin;
> insert into tab1 values (3);
>

BTW, if I just change the truncate statement to "Analyze tab1" in your
entire test then I am getting a different assertion failure:

postgres.exe!ExceptionalCondition(const char * conditionName, const
char * errorType, const char * fileName, int lineNumber) Line 70 C
postgres.exe!AssertTXNLsnOrder(ReorderBuffer * rb) Line 902 C
postgres.exe!ReorderBufferTXNByXid(ReorderBuffer * rb, unsigned int
xid, bool create, bool * is_new, unsigned __int64 lsn, bool
create_as_top) Line 681 C
postgres.exe!ReorderBufferAddNewTupleCids(ReorderBuffer * rb, unsigned
int xid, unsigned __int64 lsn, RelFileLocator locator, ItemPointerData
tid, unsigned int cmin, unsigned int cmax, unsigned int combocid) Line
3188 C
postgres.exe!SnapBuildProcessNewCid(SnapBuild * builder, unsigned int
xid, unsigned __int64 lsn, xl_heap_new_cid * xlrec) Line 823 C
postgres.exe!heap2_decode(LogicalDecodingContext * ctx,
XLogRecordBuffer * buf) Line 408 C
postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext *
ctx, XLogReaderState * record) Line 119 C
postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData
* fcinfo, bool confirm, bool binary) Line 274 C
postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData *
fcinfo) Line 339 C

This is matching with call stack we see intermittently in the BF
[1][2]. The difference with your scenario is that the Truncate
statement generates an additional WAL XLOG_STANDBY_LOCK prior to
XLOG_HEAP2_NEW_CID.

I think we can fix this in the below ways:
a. Assert(prev_first_lsn <= cur_txn->first_lsn); -- Explain in
comments that it is possible when subtransaction and transaction are
not previously logged as it happened in this scenario
b. track txn of prev_first_lsn (say as prev_txn) and check if
prev_txn's toptxn is the same as cur_txn or cur_txn's toptxn is the
same as the prev_txn then perform assert mentioned in (a) else, keep
the current Assert.

It seems (b) will be more robust.

Thoughts?

Note: I added Sawada-San as sometime back we had an offlist discussion
on this intermittent BF failure but we were not able to reach the
exact test which can show this failure.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-08-20%2002%3A45%3A34
[2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-09-03 05:03:43 Re: win_flex.exe (and likely win_bison.exe) isn't concurrency safe
Previous Message Nathan Bossart 2022-09-03 04:30:58 Re: postgres_fdw hint messages