Re: Synchronizing slots from primary to standby

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-01-31 15:49:37
Message-ID: CAD21AoAv6FwZ6UPNTj6=7A+3O2m4utzfL8ZGS6X1EGexikG66A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 31, 2024 at 7:42 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > Thank you for updating the patches. As for the slotsync worker patch,
> > is there any reason why 0001, 0002, and 0004 patches are still
> > separated?
> >
>
> No specific reason, it could be easier to review those parts.

Okay, I think we can merge 0001 and 0002 at least as we don't need
bgworker codes.

>
> >
> > Beside, here are some comments on v74 0001, 0002, and 0004 patches:
> >
> > ---
> > +static char *
> > +wait_for_valid_params_and_get_dbname(void)
> > +{
> > + char *dbname;
> > + int rc;
> > +
> > + /* Sanity check. */
> > + Assert(enable_syncslot);
> > +
> > + for (;;)
> > + {
> > + if (validate_parameters_and_get_dbname(&dbname))
> > + break;
> > + ereport(LOG, errmsg("skipping slot synchronization"));
> > +
> > + ProcessSlotSyncInterrupts(NULL);
> >
> > When reading this function, I expected that the slotsync worker would
> > resume working once the parameters became valid, but it was not
> > correct. For example, if I changed hot_standby_feedback from off to
> > on, the slotsync worker reads the config file, exits, and then
> > restarts. Given that the slotsync worker ends up exiting on parameter
> > changes anyway, why do we want to have it wait for parameters to
> > become valid?
> >
>
> Right, the reason for waiting is to avoid repeated re-start of
> slotsync worker if the required parameter is not changed. To follow
> that, I think we should simply continue when the required parameter is
> changed and is valid. But, I think during actual slotsync, if
> connection_info is changed then there is no option but to restart.

Agreed.

> >
> > ---
> > +bool
> > +SlotSyncWorkerCanRestart(void)
> > +{
> > +#define SLOTSYNC_RESTART_INTERVAL_SEC 10
> > +
> >
> > IIUC depending on how busy the postmaster is and the timing, the user
> > could wait for 1 min to re-launch the slotsync worker. But I think the
> > user might want to re-launch the slotsync worker more quickly for
> > example when the slotsync worker restarts due to parameter changes.
> > IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the
> > slotsync worker previously exited with 0 or 1.
> >
>
> Considering my previous where we don't want to restart for a required
> parameter change, isn't it better to avoid repeated restart (say when
> the user gave an invalid dbname)? BTW, I think this restart interval
> is added based on your previous complaint [1].

I think it's useful that the slotsync worker restarts immediately when
a required parameter is changed but waits to restart when it exits
with an error. IIUC the apply worker does so; if it restarts due to a
subscription parameter change, it resets the last-start time so that
the launcher will restart it without waiting. But if it exits with an
error, the launcher waits for wal_retrieve_retry_interval. I don't
think the slotsync worker must follow this behavior but I feel it's
useful behavior.

>
> >
> > ---
> > When I dropped a database on the primary that has a failover slot, I
> > got the following logs on the standby:
> >
> > 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is
> > active for PID 1103935
> > 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20
> > for Database/DROP: dir 1663/16384
> > 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID
> > 1103933) exited with exit code 1
> >
> > It seems that because the slotsync worker created the slot on the
> > standby, the slot's active_pid is still valid.
> >
>
> But we release the slot after sync. And we do take a shared lock on
> the database to make the startup process wait for slotsync. There is
> one gap which is that we don't reset active_pid for temp slots in
> ReplicationSlotRelease(), so for temp slots such an error can occur
> but OTOH, we immediately make the slot persistent after sync. As per
> my understanding, it is only possible to get this error if the initial
> sync doesn't happen and the slot remains temporary. Is that your case?
> How did reproduce this?

I created a failover slot manually on the primary and dropped the
database where the failover slot is created. So this would not happen
in normal cases.

BTW I've tested the following switch/fail-back scenario but it seems
not to work fine. Am I missing something?

Setup:
node1 is the primary, node2 is the physical standby for node1, and
node3 is the subscriber connecting to node1.

Steps:
1. [node1]: create a table and a publication for the table.
2. [node2]: set enable_syncslot = on and start (to receive WALs from node1).
3. [node3]: create a subscription with failover = true for the publication.
4. [node2]: promote to the new standby.
5. [node3]: alter subscription to connect the new primary, node2.
6. [node1]: stop, set enable_syncslot = on (and other required
parameters), then start as a new standby.

Then I got the error "exiting from slot synchronization because same
name slot "test_sub" already exists on the standby".

The logical replication slot that was created on the old primary
(node1) has been synchronized to the old standby (node2). Therefore on
node2, the slot's "synced" field is true. However, once node1 starts
as the new standby with slot synchronization, the slotsync worker
cannot synchronize the slot because the slot's "synced" field on the
primary is false.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Eli Schwartz 2024-01-31 15:50:44 Re: make dist using git archive
Previous Message Andrew Dunstan 2024-01-31 15:41:57 Re: [PATCH] Add native windows on arm64 support