RE: [HACKERS] logical decoding of two-phase transactions

From: "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>
To: Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Smith <smithpb2250(at)gmail(dot)com>
Subject: RE: [HACKERS] logical decoding of two-phase transactions
Date: 2021-05-18 10:50:51
Message-ID: OS0PR01MB61138651E2130F3E9152C45BFB2C9@OS0PR01MB6113.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Ajin

>The above patch had some changes missing which resulted in some tap
>tests failing. Sending an updated patchset. Keeping the patchset
>version the same.

Thanks for your patch. I see a problem about Segmentation fault when using it. Please take a look at this.
The steps to reproduce the problem are as follows.

------publisher------
create table test (a int primary key, b varchar);
create publication pub for table test;

------subscriber------
create table test (a int primary key, b varchar);
create subscription sub connection 'dbname=postgres' publication pub with(two_phase=on);

Then, I prepare, commit, rollback transactions and TRUNCATE table in a sql as follows:
-------------
BEGIN;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1, 10000) s(i);
PREPARE TRANSACTION 't1';
COMMIT PREPARED 't1';

BEGIN;
INSERT INTO test SELECT i, md5(i::text) FROM generate_series(10001, 20000) s(i);
PREPARE TRANSACTION 't2';
ROLLBACK PREPARED 't2';

TRUNCATE test;
-------------

To make sure the problem produce easily, I looped above operations in my sql file about 10 times, then I can 100% reproduce it and got segmentation fault in publisher log as follows:
-------------
2021-05-18 16:30:56.952 CST [548189] postmaster LOG: server process (PID 548222) was terminated by signal 11: Segmentation fault
2021-05-18 16:30:56.952 CST [548189] postmaster DETAIL: Failed process was running: START_REPLICATION SLOT "sub" LOGICAL 0/0 (proto_version '3', two_phase 'on', publication_names '"pub"')
-------------

Here is the core dump information :
-------------
#0 0x000000000090afe4 in pq_sendstring (buf=buf(at)entry=0x251ca80, str=0x0) at pqformat.c:199
#1 0x0000000000ab0a2b in logicalrep_write_begin_prepare (out=0x251ca80, txn=txn(at)entry=0x25346e8) at proto.c:124
#2 0x00007f9528842dd6 in pgoutput_begin_prepare (ctx=ctx(at)entry=0x2514700, txn=txn(at)entry=0x25346e8) at pgoutput.c:495
#3 0x00007f9528843f70 in pgoutput_truncate (ctx=0x2514700, txn=0x25346e8, nrelations=1, relations=0x262f678, change=0x25370b8) at pgoutput.c:905
#4 0x0000000000aa57cb in truncate_cb_wrapper (cache=<optimized out>, txn=<optimized out>, nrelations=<optimized out>, relations=<optimized out>, change=<optimized out>)
at logical.c:1103
#5 0x0000000000abf333 in ReorderBufferApplyTruncate (streaming=false, change=0x25370b8, relations=0x262f678, nrelations=1, txn=0x25346e8, rb=0x2516710)
at reorderbuffer.c:1918
#6 ReorderBufferProcessTXN (rb=rb(at)entry=0x2516710, txn=0x25346e8, commit_lsn=commit_lsn(at)entry=27517176, snapshot_now=<optimized out>, command_id=command_id(at)entry=0,
streaming=streaming(at)entry=false) at reorderbuffer.c:2278
#7 0x0000000000ac0b14 in ReorderBufferReplay (txn=<optimized out>, rb=rb(at)entry=0x2516710, xid=xid(at)entry=738, commit_lsn=commit_lsn(at)entry=27517176,
end_lsn=end_lsn(at)entry=27517544, commit_time=commit_time(at)entry=674644388404356, origin_id=0, origin_lsn=0) at reorderbuffer.c:2591
#8 0x0000000000ac1713 in ReorderBufferCommit (rb=0x2516710, xid=xid(at)entry=738, commit_lsn=27517176, end_lsn=27517544, commit_time=commit_time(at)entry=674644388404356,
origin_id=origin_id(at)entry=0, origin_lsn=0) at reorderbuffer.c:2615
#9 0x0000000000a9f702 in DecodeCommit (ctx=ctx(at)entry=0x2514700, buf=buf(at)entry=0x7ffdd027c2b0, parsed=parsed(at)entry=0x7ffdd027c140, xid=xid(at)entry=738,
two_phase=<optimized out>) at decode.c:742
#10 0x0000000000a9fc6c in DecodeXactOp (ctx=ctx(at)entry=0x2514700, buf=buf(at)entry=0x7ffdd027c2b0) at decode.c:278
#11 0x0000000000aa1b75 in LogicalDecodingProcessRecord (ctx=0x2514700, record=0x2514ac0) at decode.c:142
#12 0x0000000000af6db1 in XLogSendLogical () at walsender.c:2876
#13 0x0000000000afb6aa in WalSndLoop (send_data=send_data(at)entry=0xaf6d49 <XLogSendLogical>) at walsender.c:2306
#14 0x0000000000afbdac in StartLogicalReplication (cmd=cmd(at)entry=0x24da288) at walsender.c:1206
#15 0x0000000000afd646 in exec_replication_command (
cmd_string=cmd_string(at)entry=0x2452570 "START_REPLICATION SLOT \"sub\" LOGICAL 0/0 (proto_version '3', two_phase 'on', publication_names '\"pub\"')") at walsender.c:1646
#16 0x0000000000ba3514 in PostgresMain (argc=argc(at)entry=1, argv=argv(at)entry=0x7ffdd027c560, dbname=<optimized out>, username=<optimized out>) at postgres.c:4482
#17 0x0000000000a7284a in BackendRun (port=port(at)entry=0x2477b60) at postmaster.c:4491
#18 0x0000000000a78bba in BackendStartup (port=port(at)entry=0x2477b60) at postmaster.c:4213
#19 0x0000000000a78ff9 in ServerLoop () at postmaster.c:1745
#20 0x0000000000a7bbdf in PostmasterMain (argc=argc(at)entry=3, argv=argv(at)entry=0x244bae0) at postmaster.c:1417
#21 0x000000000090dc80 in main (argc=3, argv=0x244bae0) at main.c:209
-------------

I noticed that it called pgoutput_truncate function and pgoutput_begin_prepare function. It seems odd because TRUNCATE is not in a prepared transaction in my case.

I tried to debug this to learn more and found that in pgoutput_truncate function, the value of in_prepared_txn was true. Later, it got a segmentation fault when it tried to get gid in logicalrep_write_begin_prepare function - it has no gid so we got the segmentation fault.

FYI:
I also tested the case in synchronous mode, and it can execute successfully. So, I think the value of in_prepared_txn is sometimes incorrect in asynchronous mode. Maybe there's a better way to get this.

Regards
Tang

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2021-05-18 11:04:18 Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619
Previous Message Masahiko Sawada 2021-05-18 09:28:49 Re: PG 14 release notes, first draft