Re: Synchronizing slots from primary to standby

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-10-25 02:02:04
Message-ID: CAJpy0uCexeM-9dVe4eWgjj7G5MuNqc6CFRcrHu8fjmPmtbSZ-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> Hi,
>
> On 10/24/23 7:44 AM, Ajin Cherian wrote:
> > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
> > <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >>
> >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
> >> SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
> >> }
> >>
> >> + /* set failover in the slot, as requested */
> >> + slot->data.failover = ctx->failover;
> >> +
> >>
> >> I think we can get rid of this change in CreateDecodingContext().
> >>
> > Yes, I too noticed this in my testing, however just removing this from
> > CreateDecodingContext will not allow us to change the slot's failover flag
> > using Alter subscription.
>
> Oh right.
>
> > I am thinking of moving this change to
> > StartLogicalReplication prior to calling CreateDecodingContext by
> > parsing the command options in StartReplicationCmd
> > without adding it to the LogicalDecodingContext.
> >
>
> Yeah, that looks like a good place to update "failover".
>
> Doing more testing and I have a couple of remarks about he current behavior.
>
> 1) Let's imagine that:
>
> - there is no standby
> - standby_slot_names is set to a valid slot on the primary (but due to the above, not linked to any standby)
> - then a create subscription on a subscriber WITH (failover = true) would start the
> synchronisation but never finish (means leaving a "synchronisation" slot like "pg_32811_sync_24576_7293415241672430356"
> in place coming from ReplicationSlotNameForTablesync()).
>
> That's expected, but maybe we should emit a warning in WalSndWaitForStandbyConfirmation() on the primary when there is
> a slot part of standby_slot_names which is not active/does not have an active_pid attached to it?
>

Agreed, Will do that.

> 2) When we create a subscription, another slot is created during the subscription synchronization, namely
> like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()).
>
> This extra slot appears to have failover also set to true.
>
> So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like
> on the standby:
>
> LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and and catalog xmin (756)
> LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756)
> LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and and catalog xmin (756)
> WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation
>
> I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover
> during the subscription creation, better to re-launch the subscription instead?
>

'Failover' property of subscription is carried to the tablesync-slot
in recent versions only with the intent that if failover happens
during create-sub during table-sync of large tables, then users should
be able to start from that point onward on the new primary. But yes,
the above scenario is highly probable where-in no activity is
happening on primary and thus the table-sync slot is waiting for its
creation during sync on standby. So I agree, to simplify the stuff we
can skip table-sync slot syncing on standby and document this
behaviour.

thanks
Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2023-10-25 02:03:34 Re: post-recovery amcheck expectations
Previous Message Nathan Bossart 2023-10-25 01:54:01 Re: Improve WALRead() to suck data directly from WAL buffers when possible