Re: Synchronizing slots from primary to standby

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-20 12:49:08
Message-ID: CAD21AoB2ipSzQb5-o5pEYKie4oTPJTsYR1ip9_wRVrF6HbBWDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 20, 2024 at 12:33 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> >
> > I've reviewed the v91 patch. Here are random comments:
>
> Thanks for the comments.
>
> > ---
> > /*
> > * Checks the remote server info.
> > *
> > - * We ensure that the 'primary_slot_name' exists on the remote server and the
> > - * remote server is not a standby node.
> > + * Check whether we are a cascading standby. For non-cascading standbys, it
> > + * also ensures that the 'primary_slot_name' exists on the remote server.
> > */
> >
> > IIUC what the validate_remote_info() does doesn't not change by this
> > patch, so the previous comment seems to be clearer to me.
> >
> > ---
> > if (remote_in_recovery)
> > + {
> > + /*
> > + * If we are a cascading standby, no need to check further for
> > + * 'primary_slot_name'.
> > + */
> > ereport(ERROR,
> > errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > errmsg("cannot synchronize replication slots from a
> > standby server"));
> > + }
> > + else
> > + {
> > + bool primary_slot_valid;
> >
> > - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
> > - Assert(!isnull);
> > + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
> > + Assert(!isnull);
> >
> > - if (!primary_slot_valid)
> > - ereport(ERROR,
> > - errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > - errmsg("bad configuration for slot synchronization"),
> > - /* translator: second %s is a GUC variable name */
> > - errdetail("The replication slot \"%s\" specified by
> > \"%s\" does not exist on the primary server.",
> > - PrimarySlotName, "primary_slot_name"));
> > + if (!primary_slot_valid)
> > + ereport(ERROR,
> > + errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("bad configuration for slot synchronization"),
> > + /* translator: second %s is a GUC variable name */
> > + errdetail("The replication slot \"%s\" specified
> > by \"%s\" does not exist on the primary server.",
> > + PrimarySlotName, "primary_slot_name"));
> > + }
> >
> > I think it's a refactoring rather than changes required by the
> > slotsync worker. We can do that in a separate patch but why do we need
> > this change in the first place?
>
> In v90, this refactoring was made due to the fact that
> validate_remote_info() was supposed to behave differently for SQL
> function and slot-sync worker. SQL-function was supposed to ERROR out
> while the worker was supposed to LOG and become no-op. And thus the
> change was needed. In v91, we made this functionality same i.e. both
> sql function and worker will error out but missed to remove this
> refactoring. Thanks for catching this, I will revert it in the next
> version. To match the refactoring, I made the comment change too, will
> revert that as well.
>
> > ---
> > + ValidateSlotSyncParams(ERROR);
> > +
> > /* Load the libpq-specific functions */
> > load_file("libpqwalreceiver", false);
> >
> > - ValidateSlotSyncParams();
> > + (void) CheckDbnameInConninfo();
> >
> > Is there any reason why we move where to check the parameters?
>
> Earlier DBname verification was done inside ValidateSlotSyncParams()
> and thus it was needed to load 'libpqwalreceiver' before we call this
> function. Now we have moved dbname verification in a separate call and
> thus the above order got changed. ValidateSlotSyncParams() is a common
> function used by SQL function and worker. Earlier slot sync worker was
> checking all the GUCs after starting up and was exiting each time any
> GUC was invalid. It was suggested that it would be better to check the
> GUCs before starting the slot sync worker in the postmaster itself,
> making the ValidateSlotSyncParams() move to postmaster (see
> SlotSyncWorkerAllowed). But it was not a good idea to load libpq in
> postmaster and thus we moved libpq related verification out of
> ValidateSlotSyncParams(). This resulted in the above change. I hope
> it answers your query.

Thank you for the explanation. It makes sense to me to move the check.

As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I
have two comments:

1. The error messages are not very descriptive and seem not to match
other messages the postmaster says. When starting the standby server
with misconfiguration about the slotsync, I got the following messages
from the postmaster:

2024-02-20 17:01:16.356 JST [456741] LOG: database system is ready to
accept read-only connections
2024-02-20 17:01:16.358 JST [456741] LOG: bad configuration for slot
synchronization
2024-02-20 17:01:16.358 JST [456741] HINT: "hot_standby_feedback"
must be enabled.

It says "bad configuration" but is still working, and does not say
further information such as whether it skipped to start the slotsync
worker etc. I think these messages could work for the slotsync worker
but we might want to have more descriptive messages for the
postmaster. For example, "skipped starting slot sync worker because
hot_standby_feedback is disabled".

2. If the wal_level is not logical, the server will need to restart
anyway to change the wal_level and have the slotsync worker work. Does
it make sense to have the postmaster exit if the wal_level is not
logical and sync_replication_slots is enabled? For instance, we have
similar checks in PostmsaterMain():

if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL)
ereport(ERROR,
(errmsg("WAL cannot be summarized when wal_level is
\"minimal\"")));

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-02-20 13:10:44 Re: Synchronizing slots from primary to standby
Previous Message Peter Eisentraut 2024-02-20 12:40:27 Re: Replace current implementations in crypt() and gen_salt() to OpenSSL