Re: How can end users know the cause of LR slot sync delays?

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Subject: Re: How can end users know the cause of LR slot sync delays?
Date: 2025-10-30 12:16:02
Message-ID: CANhcyEV7B1VnxUQbBUqwvWZvNmR-ao6jti_0M_Tn+o9vnwh58Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 20 Oct 2025 at 14:27, Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Shlok,
>
> > > 01.
> > > ```
> > > + /* Update the slot sync reason */
> > > + SpinLockAcquire(&slot->mutex);
> > > + if (slot->slot_sync_skip_reason != skip_reason)
> > > + slot->slot_sync_skip_reason = skip_reason;
> > > + SpinLockRelease(&slot->mutex);
> > > ```
> > >
> > > Per my understanding, spinlock is acquired when the attribute on the shared
> > memory
> > > is updating. Can you check other parts and follow the rukle?
> > >
> > > 02.
> > > ```
> > > + SpinLockAcquire(&slot->mutex);
> > > + synced = slot->data.synced;
> > > + SpinLockRelease(&slot->mutex);
> > > ```
> > >
> > > Same as 1.
> > >
> > I checked and found the following comment:
> > * - Individual fields are protected by mutex where only the backend owning
> > * the slot is authorized to update the fields from its own slot. The
> > * backend owning the slot does not need to take this lock when reading its
> > * own fields, while concurrent backends not owning this slot should take the
> > * lock when reading this slot's data.
> >
I realised that the patch does not entirely follow the above rule. As
per my understanding rule says:
1. If we want to update a field of a slot we need to own the slot (or
we can say acquire the slot)
2. In the above case we also need to take a Spinlock on the slot to
update any field.
3. If we want to read a field and if we own the slot we do not need a
spinlock on the slot.
4. If we want to read a field and if we do not own the slot we need a
spinlock on the slot.

For the current patch (v5), in the function "synchronize_one_slot" we
are calling "update_slot_sync_skip_stats" even if we do not own the
slot.
So, as per the rule I have updated "update_slot_sync_skip_stats" to
own the slot before updating.

> > So for the above two cases we are updating the
> > 'slot->slot_sync_skip_reason' and reading 'slot->data.synced' and this
> > can happen before the slot sync worker acquires the slot or owns the
> > slot.
> > Also in the same code at a later stage we are again checking the
> > synced flag and we do that while holding a spin lock. Based on these
> > observations I think we should take Spinlock in both cases.
>
> Hmm, regarding the update_slot_sync_skip_stats(), the replication slot has already been
> acquired except synchronize_one_slot() case.
> Can we avoid acquiring the spinlock as much as possible by adding an argument?
> Or it just introduces additional complexity?
>
After updating the code as per rule, I think we always have to take a
Spinlock on the slot when we are updating any field.

> > > 09.
> > > ```
> > > +my $connstr_1 = $primary->connstr;
> > > ```
> > >
> > > Since this is an only connection string in the test, suffix _1 is not needed.
> > >
> > Fixed
>
> Same as the comment, can you replace "standby1" to "stanby"?
>
Fixed

> >
> > > 10.
> > > ```
> > > +# Simulate standby connection failure by modifying pg_hba.conf
> > > +unlink($primary->data_dir . '/pg_hba.conf');
> > > +$primary->append_conf('pg_hba.conf',
> > > + qq{local all all
> > trust}
> > > +);
> > > ```
> > >
> > > What if the system does not have Unix domain socket? I'm afraid all connections
> > > could be brocked in this case.
> > >
> > I have used an injection point to simulate this scenario instead of
> > changing the contents of pg_hba.conf files.
>
> Can you clarify the reason why you used the injection point?
> I'm not sure the injection point is beneficial here. I feel the point can be added
> when we handle the timing issue, race condition etc, but walreceiver may not have
> strong reasons to stop exact at that point.
>
> Regarding the content of pg_hba.conf, I felt below lines might be enough:
> ```
> local all all trust
> host all all 127.0.0.1/32 trust
> ```
>
I checked this.
By default pg_hba.conf has contents as:
```
# "local" is for Unix domain socket connections only
local all all trust
# IPv4 local connections:
host all all 127.0.0.1/32 trust
# IPv6 local connections:
host all all ::1/128 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
```
Now for our test to prevent the streaming replication we can set pg_hba.conf to
```
local all all trust
host all all 127.0.0.1/32 trust
host all all ::1/128 trust
```
And then to restore streaming replication we can add following to pg_hba.conf:
```
local replication all trust
host replication all 127.0.0.1/32 trust
host replication all ::1/128 trust
```
I think this would be sufficient for our testing. Thoughts?

> Also, here are comments for v5.
>
> ```
> + <para>
> + Reason of the last slot synchronization skip.
> + </para></entry>
> ```
>
> Possible values must be clarified. This was posted in [1] but seemed to be missed.
Sorry, I missed it. I have updated it in the latest patch.

>
> ```
> + /* Update the slot sync reason */
> ```
>
> It is better to clarify updating the *skip* reason
Fixed

>
> ```
> - ReplicationSlot *slot;
> + ReplicationSlot *slot = NULL;
> ```
>
> No need to initialize as NULL.
>
Fixed

> ```
> +#include "utils/injection_point.h"
> ...
> + INJECTION_POINT("walreceiver", NULL);
> ```
>
> As I told above, I have a concern to add the injection point. I want to hear
> other's opinion as well.
>
Removed it for now as per my analysis we can modify pg_hba.conf to
simulate the scenario.

> ```
> + else
> + {
> + /* Update the slot sync stats */
> + Assert(!found_consistent_snapshot ||
> + *found_consistent_snapshot);
> + update_slot_sync_skip_stats(slot, SS_SKIP_NONE);
> + }
> ```
>
> Your patch may have another issue; if both confirmed_flush_lsn are the same
> but we do not have the consistent snapshot yet, we would get the assertion failure.
> (Again, not sure it can really happen)
> Can we use the condition as another if part? At that time we must clarify why
> it is OK to pass in case of found_consistent_snapshot == NULL.
>
Fixed

> ```
> +# Attach injection point to simulate wait
> +$standby_psql->query_safe(
> + q(select injection_points_attach('slot-sync-skip','wait')));
> ```
>
> I have been considering whether we can remove the injection point here or not.
> I think the point is used because the being synchronized slot is still temporary
> one; they would be cleaned up by ReplicationSlotCleanup().
> Can we reproduce the skip event for the permanent slot? I cannot come up with,
> but if possible no need to introduce the injection point.
I tried reproducing it but was not able to come up with a test without
injection point. Will further try to reproduce it without injection
point.

>
> [1]: https://www.postgresql.org/message-id/OSCPR01MB14966A618A8C61EC3DEE486A4F517A%40OSCPR01MB14966.jpnprd01.prod.outlook.com
>

I have attached the latest patch.

Thanks,
Shlok Kyal

Attachment Content-Type Size
v6-0002-Add-test-for-new-stats-for-slot-sync-skip.patch application/octet-stream 7.8 KB
v6-0001-Add-stats-related-to-slot-sync-skip.patch application/octet-stream 24.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniil Davydov 2025-10-30 12:23:05 Re: Fix bug with accessing to temporary tables of other sessions
Previous Message Daniil Davydov 2025-10-30 12:12:03 Re: Batching in executor