| From: | Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> |
|---|---|
| To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
| Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
| Subject: | Re: How can end users know the cause of LR slot sync delays? |
| Date: | 2025-11-12 10:57:56 |
| Message-ID: | CANhcyEWnDHueqi07tLMWaPZyChCJ6G2aReucDkZm4j5=hD-2Ow@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Fri, 31 Oct 2025 at 11:30, Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Shlok,
>
> Thanks for updating the patch. Few comments:
>
> ```
> The reason for the last slot synchronization skip. This field is set only
> for logical slots that are being synced from a primary server (that is,
> those whose <structfield>synced</structfield> field is
> <literal>true</literal>).
> ```
>
> What happens if the slot has a skip reason and the standby is promoted?
> Will the attribute be retained? If so, do we have to add some notes like "sync"?
>
I tested it and I agree that slot sync skip stats will be retained. I
have updated the docs accordingly
> ```
> +/* Update slot sync skip stats */
> +static void
> +update_slot_sync_skip_stats(ReplicationSlot *slot, SlotSyncSkipReason skip_reason,
> + bool acquire_slot)
> ```
>
> Let's follow existing codes; ReplicationSlotSetInactiveSince(), third argument
> can be `acquire_lock`.
>
I have updated it according to latest comment by Shveta in [1]
> ```
> + /*
> + * Update the slot sync related stats in pg_stat_replication_slot when a
> + * slot sync is skipped
> + */
> + if (skip_reason != SS_SKIP_NONE)
> + pgstat_report_replslot_sync_skip(slot);
> ```
>
> Is it OK to call pgstat_report_replslot_sync_skip() without any locks?
>
I think we need to acquire the slot before the call to
pgstat_report_replslot_sync_skip. This ensures
'pgstat_acquire_replslot' is called and the stats entry for the slot
is already present when we are updating the slot. Also in a similar
function 'pgstat_report_replslot', the comments says we should ensure
that pgstat_create_replslot() or pgstat_acquire_replslot() is called
before updating the stats.
> ```
> ReplicationSlotAcquire(NameStr(slot->data.name), true, true);
> ```
>
> Can you clarify the reason error_if_invalid=true? Other codes in the file use
> error_if_invalid=false.
>
I agree we should keep error_if_invalid=false because if
error_if_invalid=true and we try to acquire an invalidated slot an
error is thrown and it will exit the slotsync worker.
> ```
> + /* Update the slot sync skip reason */
> + SpinLockAcquire(&slot->mutex);
> + if (slot->slot_sync_skip_reason != skip_reason)
> + slot->slot_sync_skip_reason = skip_reason;
> + SpinLockRelease(&slot->mutex);
> ```
>
> Now the replication slot can be always acquired. Do we still have to acquire the
> spinlock even for reading the value? In other words, can we move SpinLockAcquire()
> and SpinLockRelease() into inside the if block?
>
I agree that we use SpinLockAcquire() and SpinLockRelease() inside the
if block. I have fixed it.
> ```
> # Copyright (c) 2024-2025, PostgreSQL Global Development Group
> ```
>
> I think 2024 can be removed.
>
Fixed
> ```
> my $primary = PostgreSQL::Test::Cluster->new('publisher');
> ```
>
> s/publisher/primary/.
>
Fixed
> ```
> # Pause steaming replication connection so that standby can lag behind
> unlink($primary->data_dir . '/pg_hba.conf');
> $primary->append_conf(
> 'pg_hba.conf', qq{
> local all all trust
> host all all 127.0.0.1/32 trust
> host all all ::1/128 trust
> });
> $primary->restart;
> ```
>
> Not sure it can be called like "Pause". how about like:
> ```
> Update pg_hba.conf and restart primar to reject streaming replication connections.
> WAL records won't be replicated to the standby until .conf is restored.
> ```
Fixed
> ```
> # Attempt to sync replication slots while standby is behind
> ($result, $stdout, $stderr) =
> $standby->psql('postgres', "SELECT pg_sync_replication_slots();");
> ```
>
> Can you verify the $stderr that synchornization was failed? I cannot find other
> tests which checks the message. It is enough to do once.
Added
> ```
> $result = $standby->safe_psql(
> 'postgres',
> "SELECT slot_sync_skip_reason FROM pg_replication_slots
> WHERE slot_name = 'slot_sync' AND synced AND NOT temporary"
> );
> is($result, 'missing_wal_record', "slot sync skip when standby is behind");
> ```
>
> I found the test does twice; can we remove second one?
>
Fixed
> ```
> # Cleanup: drop the logical slot and ensure standby catches up
> $primary->safe_psql('postgres',
> "SELECT pg_drop_replication_slot('slot_sync')");
> $primary->wait_for_replay_catchup($standby);
>
> $standby->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
>
> # Test for case when slot sync is skipped when the remote slot is
> # behind the local slot.
> $primary->safe_psql('postgres',
> "SELECT pg_create_logical_replication_slot('slot_sync', 'test_decoding', false, false, true)"
> );
> ```
>
> Can we use reset function instead of dropping it?
>
Here the test checks 'remote_behind' case for first slot sync cycle
(when slot is in temporary slot). To achieve that we need to recreate
the slot.
I have addressed the comment and attached the updated patch in [1].
[1]: https://www.postgresql.org/message-id/CANhcyEXEyUt8TycOqkgdWT%2BNeJASM%3D1acU1dK64UNsJeKMQFQA%40mail.gmail.com
Thanks,
Shlok Kyal
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Heikki Linnakangas | 2025-11-12 11:14:00 | Re: mxid and mxoff wraparound issues in pg_upgrade |
| Previous Message | Dagfinn Ilmari Mannsåker | 2025-11-12 10:55:47 | Re: [Patch] Windows relation extension failure at 2GB and 4GB |