Re: Synchronizing slots from primary to standby

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-01-31 08:32:00
Message-ID: CAD21AoDUfnnxP+y2cg=LhP-bQXqFE1z4US-no=u30J7X=4Z6Aw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 30, 2024 at 9:53 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Tue, Jan 30, 2024 at 4:06 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > PFA v73-0001 which addresses the above comments. Other patches will be
> > rebased and posted after pushing this one.
>
> Since v73-0001 is pushed, PFA rest of the patches. Changes are:
>
> 1) Rebased the patches.
> 2) Ran pg_indent on all.
> 3) patch001: Updated logicaldecoding.sgml for dbname requirement in
> primary_conninfo for slot-synchronization.
>

Thank you for updating the patches. As for the slotsync worker patch,
is there any reason why 0001, 0002, and 0004 patches are still
separated?

Beside, here are some comments on v74 0001, 0002, and 0004 patches:

---
+static char *
+wait_for_valid_params_and_get_dbname(void)
+{
+ char *dbname;
+ int rc;
+
+ /* Sanity check. */
+ Assert(enable_syncslot);
+
+ for (;;)
+ {
+ if (validate_parameters_and_get_dbname(&dbname))
+ break;
+ ereport(LOG, errmsg("skipping slot synchronization"));
+
+ ProcessSlotSyncInterrupts(NULL);

When reading this function, I expected that the slotsync worker would
resume working once the parameters became valid, but it was not
correct. For example, if I changed hot_standby_feedback from off to
on, the slotsync worker reads the config file, exits, and then
restarts. Given that the slotsync worker ends up exiting on parameter
changes anyway, why do we want to have it wait for parameters to
become valid? IIUC even if the slotsync worker exits when a parameter
is not valid, it restarts at some intervals.

---
+bool
+SlotSyncWorkerCanRestart(void)
+{
+#define SLOTSYNC_RESTART_INTERVAL_SEC 10
+

IIUC depending on how busy the postmaster is and the timing, the user
could wait for 1 min to re-launch the slotsync worker. But I think the
user might want to re-launch the slotsync worker more quickly for
example when the slotsync worker restarts due to parameter changes.
IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the
slotsync worker previously exited with 0 or 1.

---
+ /* We are a normal standby */
+ valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
+ Assert(!isnull);

What do you mean by "normal standby"?

---
+ appendStringInfo(&cmd,
+ "SELECT pg_is_in_recovery(), count(*) = 1"
+ " FROM pg_replication_slots"
+ " WHERE slot_type='physical' AND slot_name=%s",
+ quote_literal_cstr(PrimarySlotName));

I think we need to make "pg_replication_slots" schema-qualified.

---
+ errdetail("The primary server slot \"%s\" specified by"
+ " \"%s\" is not valid.",
+ PrimarySlotName, "primary_slot_name"));

and

+ errmsg("slot sync worker will shutdown because"
+ " %s is disabled", "enable_syncslot"));

It's better to write it in one line for better greppability.

---
When I dropped a database on the primary that has a failover slot, I
got the following logs on the standby:

2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is
active for PID 1103935
2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20
for Database/DROP: dir 1663/16384
2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID
1103933) exited with exit code 1

It seems that because the slotsync worker created the slot on the
standby, the slot's active_pid is still valid. That is why the startup
process could not drop the slot.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2024-01-31 08:45:43 Re: 003_extrafiles.pl test fails on Windows with the newer Perl versions
Previous Message vignesh C 2024-01-31 08:31:39 Re: Improve eviction algorithm in ReorderBuffer