Re: Synchronizing slots from primary to standby

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Subject: Re: Synchronizing slots from primary to standby
Date: 2024-02-20 02:55:08
Message-ID: CAD21AoCYXhDYOQDAS-rhGasC2T+tYbV=8Y18o94sB=5AxcW+yA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 19, 2024 at 9:59 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > Few comments on 0001
>
> Thanks for the feedback.
>
> > ====================
> > 1. I think it is better to error out when the valid GUC or option is
> > not set in ensure_valid_slotsync_params() and
> > ensure_valid_remote_info() instead of waiting. And we shouldn't start
> > the worker in the first place if not all required GUCs are set. This
> > will additionally simplify the code a bit.
>
> Sure, removed 'ensure' functions. Moved the validation checks to the
> postmaster before starting the slot sync worker.
>
> > 2.
> > +typedef struct SlotSyncWorkerCtxStruct
> > {
> > - /* prevents concurrent slot syncs to avoid slot overwrites */
> > + pid_t pid;
> > + bool stopSignaled;
> > bool syncing;
> > + time_t last_start_time;
> > slock_t mutex;
> > -} SlotSyncCtxStruct;
> > +} SlotSyncWorkerCtxStruct;
> >
> > I think we don't need to change the name of this struct as this can be
> > used both by the worker and the backend. We can probably add the
> > comment to indicate that all the fields except 'syncing' are used by
> > slotsync worker.
>
> Modified.
>
> > 3. Similar to above, function names like SlotSyncShmemInit() shouldn't
> > be changed to SlotSyncWorkerShmemInit().
>
> Modified.
>
> > 4.
> > +ReplSlotSyncWorkerMain(int argc, char *argv[])
> > {
> > ...
> > + on_shmem_exit(slotsync_worker_onexit, (Datum) 0);
> > ...
> > + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn));
> > ...
> > }
> >
> > Do we need two separate callbacks? Can't we have just one (say
> > slotsync_failure_callback) that cleans additional things in case of
> > slotsync worker? And, if we need both those callbacks then please add
> > some comments for both and why one is before_shmem_exit() and the
> > other is on_shmem_exit().
>
> I think we can merge these now. Earlier 'on_shmem_exit' was needed to
> avoid race-condition between startup and slot sync worker process to
> drop 'i' slots on promotion. Now we do not have any such scenario.
> But I need some time to analyze it well. Will do it in the next
> version.
>
> > In addition to the above, I have made a few changes in the comments
> > and code (cosmetic code changes). Please include those in the next
> > version if you find them okay. You need to rename .txt file to remove
> > .txt and apply atop v90-0001*.
>
> Sure, included these.
>
> Please find the patch001 attached.

I've reviewed the v91 patch. Here are random comments:

---
/*
* Checks the remote server info.
*
- * We ensure that the 'primary_slot_name' exists on the remote server and the
- * remote server is not a standby node.
+ * Check whether we are a cascading standby. For non-cascading standbys, it
+ * also ensures that the 'primary_slot_name' exists on the remote server.
*/

IIUC what the validate_remote_info() does doesn't not change by this
patch, so the previous comment seems to be clearer to me.

---
if (remote_in_recovery)
+ {
+ /*
+ * If we are a cascading standby, no need to check further for
+ * 'primary_slot_name'.
+ */
ereport(ERROR,
errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot synchronize replication slots from a
standby server"));
+ }
+ else
+ {
+ bool primary_slot_valid;

- primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
- Assert(!isnull);
+ primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull));
+ Assert(!isnull);

- if (!primary_slot_valid)
- ereport(ERROR,
- errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("bad configuration for slot synchronization"),
- /* translator: second %s is a GUC variable name */
- errdetail("The replication slot \"%s\" specified by
\"%s\" does not exist on the primary server.",
- PrimarySlotName, "primary_slot_name"));
+ if (!primary_slot_valid)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ /* translator: second %s is a GUC variable name */
+ errdetail("The replication slot \"%s\" specified
by \"%s\" does not exist on the primary server.",
+ PrimarySlotName, "primary_slot_name"));
+ }

I think it's a refactoring rather than changes required by the
slotsync worker. We can do that in a separate patch but why do we need
this change in the first place?

---
+ ValidateSlotSyncParams(ERROR);
+
/* Load the libpq-specific functions */
load_file("libpqwalreceiver", false);

- ValidateSlotSyncParams();
+ (void) CheckDbnameInConninfo();

Is there any reason why we move where to check the parameters?

Some comments not related to the patch but to the existing code:

---
It might have already been discussed but is the
src/backend/replication/logical the right place for the slocsync.c? If
it's independent of logical decoding/replication, is under
src/backend/replication could be more appropriate?

---
/* Construct query to fetch slots with failover enabled. */
appendStringInfo(&s,
"SELECT slot_name, plugin, confirmed_flush_lsn,"
" restart_lsn, catalog_xmin, two_phase, failover,"
" database, conflict_reason"
" FROM pg_catalog.pg_replication_slots"
" WHERE failover and NOT temporary");

/* Execute the query */
res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow);
pfree(s.data);

We don't need 's' as the query is constant.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Wienhold 2024-02-20 02:58:05 Re: Patch: Add parse_type Function
Previous Message Andrei Lepikhov 2024-02-20 02:49:38 Re: Memory consumed by paths during partitionwise join planning