RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication

From: "Yu Shi (Fujitsu)" <shiy(dot)fnst(at)fujitsu(dot)com>
To: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, "Wei Wang (Fujitsu)" <wangw(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: RE: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Date: 2023-06-05 11:06:45
Message-ID: OSZPR01MB6310133C87996F1182F7920FFD4DA@OSZPR01MB6310.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 1, 2023 6:54 PM Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
>
> Hi,
>
> I rebased the patch and addressed the following reviews.
>

Thanks for updating the patch. Here are some comments on 0001 patch.

1.
- ereport(LOG,
- (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has finished",
- MySubscription->name,
- get_rel_name(MyLogicalRepWorker->relid))));

Could we move this to somewhere else instead of removing it?

2.
+ if (!OidIsValid(originid))
+ originid = replorigin_create(originname);
+ replorigin_session_setup(originid, 0);
+ replorigin_session_origin = originid;
+ *origin_startpos = replorigin_session_get_progress(false);
+ CommitTransactionCommand();
+
+ /* Is the use of a password mandatory? */
+ must_use_password = MySubscription->passwordrequired &&
+ !superuser_arg(MySubscription->owner);
+ LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo, true,
+ must_use_password,
+ MySubscription->name, &err);

It seems that there is a problem when refactoring.
See commit e7e7da2f8d5.

3.
+ /* Set this to false for safety, in case we're already reusing the worker */
+ MyLogicalRepWorker->ready_to_reuse = false;
+

I am not sure do we need to lock when setting it.

4.
+ /*
+ * Allocate the origin name in long-lived context for error context
+ * message.
+ */
+ StartTransactionCommand();
+ ReplicationOriginNameForLogicalRep(MySubscription->oid,
+ MyLogicalRepWorker->relid,
+ originname,
+ originname_size);
+ CommitTransactionCommand();

Do we need the call to StartTransactionCommand() and CommitTransactionCommand()
here? Besides, the comment here is the same as the comment atop
set_apply_error_context_origin(), do we need it?

5.
I saw a segmentation fault when debugging.

It happened when calling sync_worker_exit() called (see the code below in
LogicalRepSyncTableStart()). In the case that this is not the first table the
worker synchronizes, clean_sync_worker() has been called before (in
TablesyncWorkerMain()), and LogRepWorkerWalRcvConn has been set to NULL. Then, a
segmentation fault happened because LogRepWorkerWalRcvConn is a null pointer.

switch (relstate)
{
case SUBREL_STATE_SYNCDONE:
case SUBREL_STATE_READY:
case SUBREL_STATE_UNKNOWN:
sync_worker_exit(); /* doesn't return */
}

Here is the backtrace.

#0 0x00007fc8a8ce4c95 in libpqrcv_disconnect (conn=0x0) at libpqwalreceiver.c:757
#1 0x000000000092b8c0 in clean_sync_worker () at tablesync.c:150
#2 0x000000000092b8ed in sync_worker_exit () at tablesync.c:164
#3 0x000000000092d8f6 in LogicalRepSyncTableStart (origin_startpos=0x7ffd50f30f08) at tablesync.c:1293
#4 0x0000000000934f76 in start_table_sync (origin_startpos=0x7ffd50f30f08, myslotname=0x7ffd50f30e80) at worker.c:4457
#5 0x000000000093513b in run_tablesync_worker (options=0x7ffd50f30ec0, slotname=0x0, originname=0x7ffd50f30f10 "pg_16394_16395",
originname_size=64, origin_startpos=0x7ffd50f30f08) at worker.c:4532
#6 0x0000000000935a3a in TablesyncWorkerMain (main_arg=1) at worker.c:4853
#7 0x00000000008e97f6 in StartBackgroundWorker () at bgworker.c:864
#8 0x00000000008f350b in do_start_bgworker (rw=0x12fc1a0) at postmaster.c:5762
#9 0x00000000008f38b7 in maybe_start_bgworkers () at postmaster.c:5986
#10 0x00000000008f2975 in process_pm_pmsignal () at postmaster.c:5149
#11 0x00000000008ee98a in ServerLoop () at postmaster.c:1770
#12 0x00000000008ee3bb in PostmasterMain (argc=3, argv=0x12c4af0) at postmaster.c:1463
#13 0x00000000007b6d3a in main (argc=3, argv=0x12c4af0) at main.c:198

The steps to reproduce:
Worker1, in TablesyncWorkerMain(), the relstate of new table to sync (obtained
by GetSubscriptionRelations) is SUBREL_STATE_INIT, and in the foreach loop,
before the following Check (it needs a breakpoint before locking),

LWLockAcquire(LogicalRepWorkerLock, LW_EXCLUSIVE);
if (rstate->state != SUBREL_STATE_SYNCDONE &&
!logicalrep_worker_find(MySubscription->oid, rstate->relid, false))
{
/* Update worker state for the next table */
MyLogicalRepWorker->relid = rstate->relid;
MyLogicalRepWorker->relstate = rstate->state;
MyLogicalRepWorker->relstate_lsn = rstate->lsn;
LWLockRelease(LogicalRepWorkerLock);
break;
}
LWLockRelease(LogicalRepWorkerLock);

let this table to be synchronized by another table sync worker (Worker2), and
Worker2 has finished before logicalrep_worker_find was called(). Then Worker1
tried to sync a table whose state is SUBREL_STATE_READY and the segmentation
fault happened.

Regards,
Shi Yu

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2023-06-05 11:24:13 Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)
Previous Message Ranier Vilela 2023-06-05 11:06:26 Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)