RE: [HACKERS] logical decoding of two-phase transactions

From: "tanghy(dot)fnst(at)fujitsu(dot)com" <tanghy(dot)fnst(at)fujitsu(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: [HACKERS] logical decoding of two-phase transactions
Date: 2021-05-25 06:41:25
Message-ID: OS0PR01MB611332752AFF6ADF580AB849FB259@OS0PR01MB6113.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > 13.
> > @@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt,
> > bool isTopLevel)
> > {
> > Assert(slotname);
> >
> > - walrcv_create_slot(wrconn, slotname, false,
> > + /*
> > + * Even if two_phase is set, don't create the slot with
> > + * two-phase enabled. Will enable it once all the tables are
> > + * synced and ready. This avoids race-conditions like prepared
> > + * transactions being skipped due to changes not being applied
> > + * due to checks in should_apply_changes_for_rel() when
> > + * tablesync for the corresponding tables are in progress. See
> > + * comments atop worker.c.
> > + */
> > + walrcv_create_slot(wrconn, slotname, false, false,
> >
> > Can't we enable two_phase if copy_data is false? Because in that case,
> > all relations will be in a READY state. If we do that then we should
> > also set two_phase state as 'enabled' during createsubscription. I
> > think we need to be careful to check that connect option is given and
> > copy_data is false before setting such a state. Now, I guess we may
> > not be able to optimize this to not set 'enabled' state when the
> > subscription has no rels.
> >
>
> Fixed in v77-0001

I noticed this modification in v77-0001 and executed "CREATE SUBSCRIPTION ... WITH (two_phase = on, copy_data = false)", but it crashed.
-------------
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data = false);
WARNING: relcache reference leak: relation "pg_subscription" not closed
WARNING: snapshot 0x34278d0 still active
NOTICE: created replication slot "sub" on publisher
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
-------------

There are two warnings and a segmentation fault in subscriber log:
-------------
2021-05-24 15:08:32.435 CST [2848572] WARNING: relcache reference leak: relation "pg_subscription" not closed
2021-05-24 15:08:32.435 CST [2848572] WARNING: snapshot 0x32ce8b0 still active
2021-05-24 15:08:33.012 CST [2848555] LOG: server process (PID 2848572) was terminated by signal 11: Segmentation fault
2021-05-24 15:08:33.012 CST [2848555] DETAIL: Failed process was running: CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data = false);
-------------

The backtrace about segmentation fault is attached. It happened in table_close function, we got it because "CurrentResourceOwner" was NULL.

I think it was related with the first warning, which reported "relcache reference leak". The backtrace information is attached, too. When updating two-phase state in CreateSubscription function, it released resource owner and set "CurrentResourceOwner" as NULL in CommitTransaction function.

The second warning about "snapshot still active" was also happened in CommitTransaction function. It called AtEOXact_Snapshot function, checked leftover snapshots and reported the warning.
I debugged and found the snapshot was added in function PortalRunUtility by calling PushActiveSnapshot function, the address of "ActiveSnapshot" at this time was same as the address in warning.

In summary, when creating subscription with two_phase = on and copy_data = false, it calls UpdateTwoPhaseState function in CreateSubscription function to set two_phase state as 'enabled', and it checked and released relcache and snapshot too early so the NG happened. I think some change should be made to avoid it. Thought?

FYI
I also tested the new released V78* at [1], the above NG still exists.
[1] https://www.postgresql.org/message-id/CAFPTHDab56twVmC%2B0a%3DRNcRw4KuyFdqzW0JAcvJdS63n_fRnOQ%40mail.gmail.com

Regards
Tang

Attachment Content-Type Size
backtrace_segmentation_fault.txt text/plain 3.3 KB
backtrace_first_warning.txt text/plain 3.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-05-25 06:42:46 Re: Assertion failure while streaming toasted data
Previous Message Guillaume Lelarge 2021-05-25 06:40:58 Re: Issue on catalogs.sgml