Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-11-10 03:31:07
Message-ID: CAJpy0uA6-oz=3daNYt8LwJi3bnjnhBLV3vkza52P_5rm-SzZRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 9, 2023 at 9:15 PM Drouvot, Bertrand
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Hi,
>
> On 11/9/23 11:54 AM, shveta malik wrote:
> >
> > PFA v32 patches which has below changes:
>
> Thanks!
>
> > 7) Added warning for cases where a user-slot with the same name is
> > already present which slot-sync worker is trying to create. Sync for
> > such slots is skipped.
>
> I'm seeing assertion and segfault in this case due to ReplicationSlotRelease()
> in synchronize_one_slot().
>
> Adding this extra check prior to it:
>
> - ReplicationSlotRelease();
> + if (!(found && s->data.sync_state == SYNCSLOT_STATE_NONE))
> + ReplicationSlotRelease();
>
> make them disappear.
>

Oh, I see. Thanks for pointing it out. I will fix it in the next version.

> >
> > Open Question:
> > 1) Currently I have put drop slot logic for slots with 'sync_state=i'
> > in slot-sync worker. Do we need to put it somewhere in promotion-logic
> > as well?
>
> Yeah I think so, because there is a time window when one could "use" the slot
> after the promotion and before it is removed. Producing things like:
>
> "
> 2023-11-09 15:16:50.294 UTC [2580462] LOG: dropped replication slot "logical_slot2" of dbid 5 as it was not sync-ready
> 2023-11-09 15:16:50.295 UTC [2580462] LOG: dropped replication slot "logical_slot3" of dbid 5 as it was not sync-ready
> 2023-11-09 15:16:50.297 UTC [2580462] LOG: dropped replication slot "logical_slot4" of dbid 5 as it was not sync-ready
> 2023-11-09 15:16:50.297 UTC [2580462] ERROR: replication slot "logical_slot5" is active for PID 2594628
> "
>
> After the promotion one was able to use logical_slot5 and now we can now drop it.

Yes, I was suspicious about this small window which may allow others
to use this slot, that is why I was thinking of putting it in the
promotion flow and thus asked that question earlier. But the slot-sync
worker may end up creating it again in case it has not exited. So we
need to carefully decide at what all places we need to put 'not-in
recovery' checks in slot-sync workers. In the previous version,
synchronize_one_slot() had that check and it was skipping sync if
'!RecoveryInProgress'. But I have removed that check in v32 thinking
that the slots which the worker has already fetched from the primary,
let them all get synced and exit after that nstead of syncing half
and leaving rest. But now on rethinking, was the previous behaviour
correct i.e. skip sync at that point onward where we see it is no
longer in standby-mode while few of the slots have already been synced
in that sync-cycle. Thoughts?

>
> > Perhaps in WaitForWALToBecomeAvailable() where we call
> > XLogShutdownWalRcv after checking 'CheckForStandbyTrigger'. Thoughts?
> >
>
> You mean here?
>
> /*
> * Check to see if promotion is requested. Note that we do
> * this only after failure, so when you promote, we still
> * finish replaying as much as we can from archive and
> * pg_wal before failover.
> */
> if (StandbyMode && CheckForStandbyTrigger())
> {
> XLogShutdownWalRcv();
> return XLREAD_FAIL;
> }
>

yes, here.

> If so, that sounds like a good place to me.

okay. Will add it.

>
> Regards,
>
> --
> Bertrand Drouvot
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-11-10 03:31:27 Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Previous Message Andres Freund 2023-11-10 03:06:33 Re: Failure during Building Postgres in Windows with Meson