RE: How can end users know the cause of LR slot sync delays?

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: 'Shlok Kyal' <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Subject: RE: How can end users know the cause of LR slot sync delays?
Date: 2025-09-30 12:51:58
Message-ID: OSCPR01MB149667826BC79248A20999D1BF51AA@OSCPR01MB14966.jpnprd01.prod.outlook.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Shlok,

Thanks for updating the patch. Here are my comments.

01.
```
+ /* Update the slot sync reason */
+ SpinLockAcquire(&slot->mutex);
+ if (slot->slot_sync_skip_reason != skip_reason)
+ slot->slot_sync_skip_reason = skip_reason;
+ SpinLockRelease(&slot->mutex);
```

Per my understanding, spinlock is acquired when the attribute on the shared memory
is updating. Can you check other parts and follow the rukle?

02.
```
+ SpinLockAcquire(&slot->mutex);
+ synced = slot->data.synced;
+ SpinLockRelease(&slot->mutex);
```

Same as 1.

03.
```
```

04.
```
+#ifdef USE_INJECTION_POINTS
+ INJECTION_POINT("slot-sync-skip", NULL);
+#endif
```

No need do do #ifdef here. INJECTION_POINT() itself checks internally.

05.
```
+# Initialize the primary cluster
+my $primary = PostgreSQL::Test::Cluster->new('publisher');
+$primary->init(
+ allows_streaming => 'logical',
+ auth_extra => [ '--create-role' => 'repl_role' ]);
```

Do we have to create repl_role? I'm not sure where it's used.

06.
```
+$primary->append_conf(
+ 'postgresql.conf', qq{
+autovacuum = off
+max_prepared_transactions = 1
+});
```

Do we have to set max_prepared_transactions? PREPARE command is not used.

07.
```
+# Load the injection_points extension
+$primary->safe_psql('postgres', q(CREATE EXTENSION injection_points))
```

We must check whether the injection_points is available or not. See 047_checkpoint_physical_slot.pl.

08.
```
+# Initialize standby from primary backup
+my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
+$standby1->init_from_backup(
+ $primary, $backup_name,
+ has_streaming => 1,
+ has_restoring => 1);
```

To clarify, is there a reason why we set has_restoring? Can we remove it?

09.
```
+my $connstr_1 = $primary->connstr;
```

Since this is an only connection string in the test, suffix _1 is not needed.

10.
```
+# Simulate standby connection failure by modifying pg_hba.conf
+unlink($primary->data_dir . '/pg_hba.conf');
+$primary->append_conf('pg_hba.conf',
+ qq{local all all trust}
+);
```

What if the system does not have Unix domain socket? I'm afraid all connections
could be brocked in this case.

11.
```
+# Attempt to sync replication slots while standby is behind
+$standby1->psql('postgres', "SELECT pg_sync_replication_slots();");
```

When the command can be failed, can you set a error message from the command to
a variable? Otherwise an ERROR is output to the result file, which is suprising.

```
psql:<stdin>:1: ERROR: skipping slot synchronization because the received slot sync LSN 0/03019E58 for slot "slot_sync" is ahead of the standby position 0/030000D8
[17:23:21.495](0.384s) ok 2 - slot sync skip when standby is behind
```

12.
```
+# Restore connectivity between primary and standby
+unlink($primary->data_dir . '/pg_hba.conf');
+$primary->append_conf(
+ 'pg_hba.conf',
+ qq{
+local all all trust
+local replication all trust
+});
```

Same as 10. Also, no need to do unlink() here.

13.
```
+# Create a new logical slot for testing injection point
+$primary->safe_psql('postgres',
+ "SELECT pg_create_logical_replication_slot('slot_sync', 'test_decoding', false, false, true)"
+);
```

Before here can you add a description what you would test?

14.
```
# Create a physical replication slot on primary for standby
$primary->psql('postgres',
q{SELECT pg_create_physical_replication_slot('sb1_slot');});
```
Use safe_psql instead of psql().

15.
```
- if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
+ if (slot || (slot = SearchNamedReplicationSlot(remote_slot->name, true)))
```

Is there a possibility that slot is not NULL? It is used when confirmed_flush exceeds
the latestFlushPtr, but in this case the function cannot reach here.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2025-09-30 12:59:49 Re: Remove custom redundant full page write description from GIN
Previous Message Tomas Vondra 2025-09-30 12:42:18 Re: Proposal: Adding compression of temporary files