Re: Synchronizing slots from primary to standby

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: shveta malik <shveta(dot)malik(at)gmail(dot)com>
Cc: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Ajin Cherian <itsajin(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Synchronizing slots from primary to standby
Date: 2023-12-10 11:03:35
Message-ID: CAA4eK1KbhdjKqui=fr4Ny2TwGAFU9WLWTdypN+WG0WEfnBR=4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
>
> v43-002:
>

Review comments on v43-0002:
=========================
1.
synchronize_one_slot()
{
...
+ /*
+ * With hot_standby_feedback enabled and invalidations handled
+ * apropriately as above, this should never happen.
+ */
+ if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn)
+ {
+ ereport(ERROR,
+ errmsg("not synchronizing local slot \"%s\" LSN(%X/%X)"
+ " to remote slot's LSN(%X/%X) as synchronization "
+ " would move it backwards", remote_slot->name,
+ LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn),
+ LSN_FORMAT_ARGS(remote_slot->restart_lsn)));
+
+ goto cleanup;
...
}

After the error, the control won't return, so the above goto doesn't
make any sense.

2.
synchronize_one_slot()
{
...
+ /* Search for the named slot */
+ if ((s = SearchNamedReplicationSlot(remote_slot->name, true)))
+ {
+ SpinLockAcquire(&s->mutex);
+ sync_state = s->data.sync_state;
+ SpinLockRelease(&s->mutex);
+ }
...
...
+ ReplicationSlotAcquire(remote_slot->name, true);
+
+ /*
+ * Copy the invalidation cause from remote only if local slot is not
+ * invalidated locally, we don't want to overwrite existing one.
+ */
+ if (MyReplicationSlot->data.invalidated == RS_INVAL_NONE)
+ {
+ SpinLockAcquire(&MyReplicationSlot->mutex);
+ MyReplicationSlot->data.invalidated = remote_slot->invalidated;
+ SpinLockRelease(&MyReplicationSlot->mutex);
+ }
+
+ /* Skip the sync if slot has been invalidated locally. */
+ if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
+ goto cleanup;
...

It seems useless to acquire the slot if it is locally invalidated in
the first place. Won't it be better if after the search we first check
whether the slot is locally invalidated and take appropriate action?

3. After doing the above two, I think it doesn't make sense to have
goto at the remaining places in synchronize_one_slot(). We can simply
release the slot and commit the transaction at other places.

4.
+ * Returns nap time for the next sync-cycle.
+ */
+static long
+synchronize_slots(WalReceiverConn *wrconn)

Returning nap time from here appears a bit awkward. I think it is
better if this function returns any_slot_updated and then the caller
decides the adjustment of naptime.

5.
+synchronize_slots(WalReceiverConn *wrconn)
{
...
...
+ /* The syscache access needs a transaction env. */
+ StartTransactionCommand();
+
+ /*
+ * Make result tuples live outside TopTransactionContext to make them
+ * accessible even after transaction is committed.
+ */
+ MemoryContextSwitchTo(oldctx);
+
+ /* Construct query to get slots info from the primary server */
+ initStringInfo(&s);
+ construct_slot_query(&s);
+
+ elog(DEBUG2, "slot sync worker's query:%s \n", s.data);
+
+ /* Execute the query */
+ res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow);
+ pfree(s.data);
+
+ if (res->status != WALRCV_OK_TUPLES)
+ ereport(ERROR,
+ (errmsg("could not fetch failover logical slots info "
+ "from the primary server: %s", res->err)));
+
+ CommitTransactionCommand();
...
...
}

Where exactly in the above code, there is a syscache access as
mentioned above StartTransactionCommand()?

6.
- <filename>~/.pgpass</filename> file on the standby server (use
+ <filename>~/.pgpass</filename> file on the standby server. (use
<literal>replication</literal> as the database name).

Why do we need this change?

7.
+ standby. Additionally, similar to creating a logical replication slot
+ on the hot standby, <varname>hot_standby_feedback</varname> should be
+ set on the standby and a physical slot between the primary and the standby
+ should be used.

In this, I don't understand the relation between the first part of the
line: "Additionally, similar to creating a logical replication slot on
the hot standby ..." with the rest.

8.
However,
+ the slots which were in initiated sync_state ('i) and were not

A single quote after 'i' is missing.

9.
the slots with state 'r' and 'i' can neither be used for logical
+ decoded nor dropped by the user.

/decoded/decoding

10.
+/*
+ * Allocate and initialize slow sync worker shared memory
+ */

/slow/slot

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-12-10 14:57:04 Re: Change GUC hashtable to use simplehash?
Previous Message jian he 2023-12-10 10:32:12 Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)