| From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
|---|---|
| To: | Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
| Cc: | shveta malik <shveta(dot)malik(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |
| Date: | 2026-03-20 11:28:04 |
| Message-ID: | CAFiTN-uf0DE2KphPJPhq3+_GbmzNfhTFjsSgpSqHerFdFhx6pw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Wed, Mar 18, 2026 at 4:08 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi,
>
> Thanks again for reviewing the test cases. Please find my comments inline below:
>
> On Wed, Mar 18, 2026 at 12:01 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> >
> > Few comments:
> >
> > 1)
> > +
> > +# Physical replication slots for the two standbys
> > +$primary->safe_psql('postgres',
> > + "SELECT pg_create_physical_replication_slot('sb1_slot');");
> > +$primary->safe_psql('postgres',
> > + "SELECT pg_create_physical_replication_slot('sb2_slot');");
> > +$primary->safe_psql('postgres',
> > + "SELECT pg_create_physical_replication_slot('first_slot');");
> >
> > Can we please write a comment atop 'first_slot' creation for its purpose.
> > The comment '+# Physical replication slots for the two standbys'
> > confuses as we are creating 3 slots following that comment.
> >
>
> Moved this to PART E (the test-case where this slot is being used).
> This is also suggested in one of your comments below.
>
> > 2)
> > +# PART A: Plain list (ALL mode) blocks when any slot is unavailable
> >
> > +# PART C: Verify plain list (ALL mode) requires ALL slots
> >
> > I think we can merge the 2 tests. Is there a specific reason we have
> > it in 2 different sections (am I missing something?).
> >
> > Part A can first test blocking on inactive standby (which it is doing
> > already) and then restart standby and can check that logical decoding
> > now proceeds. Part C can then be removed.
> >
>
> I did not remove PART C entirely, however, I removed the portions that
> overlapped with PART A.
>
> I would suggest not just evaluating each test case independently, it
> would be useful to also consider the overall flow to understand why
> PART B follows PART A, and so on.
>
> Viewed sequentially, PART A tests a unique scenario while also setting
> up the environment needed for PART B (quorum mode). Similarly, PART C
> covers a different scenario and prepares the setup for the test case
> in PART D. This approach helps avoid unnecessary stop/start cycles of
> the synchronous standby.
>
> > 3)
> > Same with these 2:
> > +# PART B: ANY mode (quorum) — logical decoding proceeds with N-of-M slots
> >
> > +# PART D: ANY mode — verify it works with either standby
> >
> > Part B tests that it works when standby1 is down.
> > Part D tests that it works when standby2 is down and also works when
> > both are up.
> >
> > I think we don't need these duplicate test cases (unless I am missing
> > something?), we can merge Part D to Part B itself i.e. test with
> > standby1 down and up in a single test.
> >
>
> Removed PART D. It was almost similar to PART B.
>
> > 4)
> > These 2 tests are basically testing the same thing. Do we need both?
> > +# FIRST 1 should skip sb1_slot (unavailable) and use sb2_slot.
> >
> > +# Test that FIRST 1 is different from plain list — FIRST 1 succeeds
> > with one down.
> >
>
> Yes, they are actually the same. Hence removed the latter one.
>
> >
> > 5)
> > I see that we only need 'first_slot' for the last test and do not have
> > even a standby associated with it. So we can even move it's creation
> > to that test itself instead of creating it in the beginning and adding
> > comments to explain 'why'.
> >
>
> This is related to comment 1 above.
>
> PFA patch that addresses above comments.
I was reviewing the latest patch and here are few comments/suggestions
1.
+ <para>
+ The keyword <literal>FIRST</literal>, coupled with
+ <replaceable class="parameter">num_sync</replaceable>, specifies
+ priority-based semantics. Logical decoding will wait for the first
+ <replaceable class="parameter">num_sync</replaceable> available
+ physical slots in priority order (the order they appear in the list).
+ If a slot is missing, logical, invalidated, or inactive, it will be
+ skipped. However, if a slot exists and is valid and active but has
+ not yet caught up, the system will wait for it rather than skipping
+ to lower-priority slots.
+ </para>
This explains that it will skip missing, logical, invalidated, or
inactive slots while processing the list and deciding on which first N
slots to wait for, but I could not understand what would be the
behavior if after skipping a few invalid slots the remaining valid
slots are less than N, will it proceed for decoding? I think this
should be clarified in docs.
2. I see a lot of overlapping code between
check_synchronized_standby_slots() and
check_synchronous_standby_names(), can we do some refactoring to make
a common code?
I have just pass through the patch and will try to complete the review soon.
--
Regards,
Dilip Kumar
Google
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amul Sul | 2026-03-20 11:31:00 | Re: pg_waldump: support decoding of WAL inside tarfile |
| Previous Message | Jim Jones | 2026-03-20 11:08:59 | Re: Truncate logs by max_log_size |