Re: Synchronizing slots from primary to standby

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-01-31 10:42:25
Message-ID: CAA4eK1LwP-h6XSYg=Dgqos9k-t2vrwWJwXRbCk2QkeHSpoYEjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Thank you for updating the patches. As for the slotsync worker patch,
> is there any reason why 0001, 0002, and 0004 patches are still
> separated?
>

No specific reason, it could be easier to review those parts.

>
> Beside, here are some comments on v74 0001, 0002, and 0004 patches:
>
> ---
> +static char *
> +wait_for_valid_params_and_get_dbname(void)
> +{
> + char *dbname;
> + int rc;
> +
> + /* Sanity check. */
> + Assert(enable_syncslot);
> +
> + for (;;)
> + {
> + if (validate_parameters_and_get_dbname(&dbname))
> + break;
> + ereport(LOG, errmsg("skipping slot synchronization"));
> +
> + ProcessSlotSyncInterrupts(NULL);
>
> When reading this function, I expected that the slotsync worker would
> resume working once the parameters became valid, but it was not
> correct. For example, if I changed hot_standby_feedback from off to
> on, the slotsync worker reads the config file, exits, and then
> restarts. Given that the slotsync worker ends up exiting on parameter
> changes anyway, why do we want to have it wait for parameters to
> become valid?
>

Right, the reason for waiting is to avoid repeated re-start of
slotsync worker if the required parameter is not changed. To follow
that, I think we should simply continue when the required parameter is
changed and is valid. But, I think during actual slotsync, if
connection_info is changed then there is no option but to restart.

>
> ---
> +bool
> +SlotSyncWorkerCanRestart(void)
> +{
> +#define SLOTSYNC_RESTART_INTERVAL_SEC 10
> +
>
> IIUC depending on how busy the postmaster is and the timing, the user
> could wait for 1 min to re-launch the slotsync worker. But I think the
> user might want to re-launch the slotsync worker more quickly for
> example when the slotsync worker restarts due to parameter changes.
> IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the
> slotsync worker previously exited with 0 or 1.
>

Considering my previous where we don't want to restart for a required
parameter change, isn't it better to avoid repeated restart (say when
the user gave an invalid dbname)? BTW, I think this restart interval
is added based on your previous complaint [1].

>
> ---
> When I dropped a database on the primary that has a failover slot, I
> got the following logs on the standby:
>
> 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is
> active for PID 1103935
> 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20
> for Database/DROP: dir 1663/16384
> 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID
> 1103933) exited with exit code 1
>
> It seems that because the slotsync worker created the slot on the
> standby, the slot's active_pid is still valid.
>

But we release the slot after sync. And we do take a shared lock on
the database to make the startup process wait for slotsync. There is
one gap which is that we don't reset active_pid for temp slots in
ReplicationSlotRelease(), so for temp slots such an error can occur
but OTOH, we immediately make the slot persistent after sync. As per
my understanding, it is only possible to get this error if the initial
sync doesn't happen and the slot remains temporary. Is that your case?
How did reproduce this?

That is why the startup
> process could not drop the slot.
>

[1] - https://www.postgresql.org/message-id/CAD21AoApGoTZu7D_7%3DbVYQqKnj%2BPZ2Rz%2Bnc8Ky1HPQMS_XL6%2BA%40mail.gmail.com

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nisha Moond 2024-01-31 10:58:28 Re: Improve the connection failure error messages
Previous Message Heikki Linnakangas 2024-01-31 10:37:22 Re: Extending SMgrRelation lifetimes