| From: | Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
|---|---|
| To: | surya poondla <suryapoondla4(at)gmail(dot)com> |
| Cc: | SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |
| Date: | 2026-03-23 05:32:04 |
| Message-ID: | CAE9k0PkJWX3k25hWTXPboxUtUEkGGWMvjAuE1fwkPkRvLE24AA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Mon, Mar 23, 2026 at 9:51 AM surya poondla <suryapoondla4(at)gmail(dot)com> wrote:
>
> Hi All,
>
> Thank you for reporting a real gap and building this feature to address it. Very nice points were discussed in this thread.
>
> I reviewed the v20260318 patch and some comments.
>
> Documentation comments:
> 1. FIRST mode does not specify what happens when valid slots < N
> "If a slot is missing, logical, invalidated, or inactive, it will be skipped. However, if a slot exists and is valid and active but has not yet caught up, the system will wait for it rather than skipping to lower-priority slots."
> This paragraph explains the skip/wait distinction clearly, but doesn't clearly address what happens when, after skipping all missing/invalid/inactive/logical slots, the number of remaining valid slots is less than num_sync?
>
> For example, with FIRST 2 (sb1_slot, sb2_slot, sb3_slot): if sb1_slot and sb2_slot are both invalidated and only sb3_slot is valid but lagging FIRST 2 requires two slots, but only one candidate remains.
>
> Looking at the code in StandbySlotsHaveCaughtup(), when syncrep_method == SYNC_REP_PRIORITY and a slot is lagging, the code does:
> if (wait_for_all || synchronized_standby_slots_config->syncrep_method == SYNC_REP_PRIORITY)
> break;
>
> So the function breaks out of the loop and returns false. This is the correct behavior, but it is not stated anywhere in the documentation. A user encountering this scenario will not know whether to expect a wait or an error. The documentation should state explicitly that in FIRST mode, if fewer valid slots than num_sync are available, logical decoding waits indefinitely.
>
> 2. "Missing, logical, invalidated, or inactive slots are skipped when determining candidates, and lagging slots simply do not count toward the required number until they catch up"
> This is correct for the case where some slots are skipped and others have caught up. But it does not address the case where all listed slots are lagging and every slot is healthy and connected, but none have reached wait_for_lsn yet. In that situation, the code records each slot as SS_SLOT_LAGGING, does goto next_slot for each (because syncrep_method == SYNC_REP_QUORUM), and returns false because caught_up_slot_num < required. Logical decoding waits.
>
> You can append the following sentence to the above documentation paragraph "If fewer than num_sync slots have caught up at a given moment, logical decoding waits until that threshold is reached."
>
>
> Test comments:
> 1. "PART D: Verify FIRST N priority semantics. # 3. Wait for valid but lagging slots (not skip to lower priority)"
> The test implements this by calling $standby1->stop. A stopped standby has no active_pid, so the slot is classified as SS_SLOT_INACTIVE, not SS_SLOT_LAGGING.
> SS_SLOT_LAGGING means it is connected and streaming but restart_lsn < wait_for_lsn.
>
> As Hou previously mentioned, recovery_min_apply_delay on standby1 would be one way to keep it connected while forcing its WAL application to lag, exercising the SS_SLOT_LAGGING code path directly. It is worth adding a test that covers this, both for FIRST (to confirm it blocks) and for ANY (to confirm it does not).
>
> Overall a great patch.
>
Thank you - Hou, Surya and Dilip for your feedback. I will address all
your comments in the next patch.
--
With Regards,
Ashutosh Sharma.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Fujii Masao | 2026-03-23 05:50:52 | Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? |
| Previous Message | lakshmi | 2026-03-23 04:45:59 | Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) |