RE: Excessive number of replication slots for 12->14 logical replication

From: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Hubert Lubaczewski <depesz(at)depesz(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: Excessive number of replication slots for 12->14 logical replication
Date: 2022-09-10 03:41:01
Message-ID: OS0PR01MB5716E128E78C6CECD15C718394429@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Saturday, September 10, 2022 5:49 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Aug 30, 2022 at 3:44 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Aug 26, 2022 at 7:04 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> > >
> > > Thanks for the testing. I'll push this sometime early next week (by
> > > Tuesday) unless Sawada-San or someone else has any comments on it.
> > >
> >
> > Pushed.
>
> Tom reported buildfarm failures[1] and I've investigated the cause and
> concluded this commit is relevant.
>
> In process_syncing_tables_for_sync(), we have the following code:
>
> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
> MyLogicalRepWorker->relid,
> MyLogicalRepWorker->relstate,
> MyLogicalRepWorker->relstate_lsn);
>
> ReplicationOriginNameForTablesync(MyLogicalRepWorker->subid,
> MyLogicalRepWorker->relid,
> originname,
> sizeof(originname));
> replorigin_session_reset();
> replorigin_session_origin = InvalidRepOriginId;
> replorigin_session_origin_lsn = InvalidXLogRecPtr;
> replorigin_session_origin_timestamp = 0;
>
> /*
> * We expect that origin must be present. The concurrent operations
> * that remove origin like a refresh for the subscription take an
> * access exclusive lock on pg_subscription which prevent the previou
> * operation to update the rel state to SUBREL_STATE_SYNCDONE to
> * succeed.
> */
> replorigin_drop_by_name(originname, false, false);
>
> /*
> * End streaming so that LogRepWorkerWalRcvConn can be used to
> drop
> * the slot.
> */
> walrcv_endstreaming(LogRepWorkerWalRcvConn, &tli);
>
> /*
> * Cleanup the tablesync slot.
> *
> * This has to be done after the data changes because otherwise if
> * there is an error while doing the database operations we won't be
> * able to rollback dropped slot.
> */
> ReplicationSlotNameForTablesync(MyLogicalRepWorker->subid,
> MyLogicalRepWorker->relid,
> syncslotname,
> sizeof(syncslotname));
>
> If the table sync worker errored at walrcv_endstreaming(), we assumed that both
> dropping the replication origin and updating relstate are rolled back, which
> however was wrong. Indeed, the replication origin is not dropped but the
> in-memory state is reset. Therefore, after the tablesync worker restarts, it starts
> logical replication with starting point 0/0. Consequently, it ends up applying
> the transaction that has already been applied.

Thanks for the analysis !

I think you are right. The replorigin_drop_by_name() will clear the
remote_lsn/local_lsn in shared memory which won't be rollback if we fail to
drop the origin.

Per off-list discussion with Amit. To fix this problem, I think we need to drop
the origin in table sync worker after committing the transaction which set the
relstate to SYNCDONE. Because it can make sure that the worker won’t be
restarted even if we fail to drop the origin. Besides, we need to add the
origin drop code back in apply worker in case the table sync worker failed to
drop the origin before it exits(which seems a rare case). I will share the
patch if we agree with the fix.

Best regards,
Hou zj

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2022-09-10 05:36:44 Re: Excessive number of replication slots for 12->14 logical replication
Previous Message Masahiko Sawada 2022-09-09 21:48:54 Re: Excessive number of replication slots for 12->14 logical replication