Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Hou, Zhijie/侯 志杰 <houzj(dot)fnst(at)fujitsu(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, shveta malik <shvetamalik(at)gmail(dot)com>
Subject: Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication
Date: 2026-05-14 11:06:18
Message-ID: CAJpy0uCKGCkfCXCd=tsDH5e85x155LsdbZW46WpWfsZJUe82bw@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 13, 2026 at 5:25 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi,
>
> On Wed, Apr 15, 2026 at 12:17 PM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> >
> > On Wed, Apr 15, 2026 at 12:00 PM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> > >
> > >
> > > On Wed, Apr 15, 2026 at 11:09 AM shveta malik <shveta(dot)malik(at)gmail(dot)com> wrote:
> > > >
> > > > With this, 'FIRST 1(standby_1, standby_2)' is convered to wait_for_all
> > > > mode, which is not correct and it keeps wiating for standby_2 when
> > > > standby_1 has already taken changes. I am not sure what is correct way
> > > > to deal with it when it comes to first patch alone. My expectation was
> > > > that FIRST-syntax is blocked i.e. it errors out instead of partially
> > > > implemented and misbehaving. But if we plan to do so, the challenge
> > > > will be how to distinguish actual FIRST and comma separated list
> > > > implicitly converted to 'FIRST 1' by syncrep parser. For that we will
> > > > need either 003 or IsPrioritySyncStandbySlotsSyntax', thus defeating
> > > > the whole purpose of separating the patches. What do you think on
> > > > this? 001 is okay as is or we shall block FIRST?
> > > >
> > >
> > > AFAICS, we should first finalize the synchronous replication parser
> > > changes to give an identity to the plain list mode. Once that is
> > > settled, we may need to reorder the patch based on the decision we
> > > take. If we decide to proceed with the parser changes, my
> > > understanding is that they should come first, followed by the
> > > implementation of support for the ANY/FIRST clauses.
> >
> > I agree. I do not see any other correct way to do it.
> >
>
> I've reordered the patches as mentioned here to ensure each patch is
> independently functional. The updated sequence is:
>
> 1) 0001-Refactor-syncrep-parsing-to-represent-bare-standby-l.patch :
> Introduces a new synchronous replication method SYNC_REP_DEFAULT to
> represent the bare list form parsed from standby_list. This lets
> callers clearly distinguish between three forms:
>
> - Explicit priority syntax — FIRST N (...) or N (...)
> - Quorum syntax — ANY N (...)
> - Simple list syntax — no FIRST or ANY keyword
>
> 2) 0002-Add-ANY-N-semantics-to-synchronized_standby_slots.patch :
> Extends synchronized_standby_slots with ANY N quorum semantics.
>
> 3) 0003-Add-FIRST-N-and-N-.-priority-syntax-to-synchronized_.patch :
> Adds support for FIRST N and N (...) priority syntax to
> synchronized_standby_slots
>
> >
> > A few trivial comments on 001:
> >
> > 1)
> > +# B) ANY 1 (sb1_slot, sb2_slot) (quorum mode)
> > +# - Proceeds when at least N slots have caught up
> > +# - Skips missing/invalid/logical slots and lagging slots
> > (inactive or active)
> > +# to find N caught-up slots
> >
> > We can mention 'ANY N' instead of 'ANY 1' so that the explanation
> > using 'N' makes more sense.
> >
> > 2)
> > +sub poll_query_until_quiet
> > +{
> > + my ($node, $dbname, $query, $expected, $timeout_secs) = @_;
> > +
> > + $expected = 't' unless defined($expected);
> > + $timeout_secs = $PostgreSQL::Test::Utils::timeout_default
> >
> > We can get rid of this function now.
>
> Above two comments have been taken care of in the attached patches.
>
> Please take a look and let me know your views.
>

Okay, it seems much better now. IMO, the changes of 001 are simple
enough to be incorporated but lets see what others have to say.

By changing the order of patches, 0002 (ANY) now works well, the
previously reported problem is gone. I had yet to verify 003
compeltely. Will finish the reveiw soon.

A few minor points, though:

patch 002:

1)
+ * The layout mirrors SyncRepConfigData so that the same quorum/priority

quorum/priority --> quorum and priority

2)
+ if (!inactive)
+ slot_states[num_slot_states].restart_lsn = restart_lsn;

I think there is no harm in making this assignmenet irrespective of
'inactive' flag. The code will look clean.

patch003:

3)
- wait_for_all = (required == synchronized_standby_slots_config->nslotnames);
+ wait_for_all =
+ (synchronized_standby_slots_config->syncrep_method == SYNC_REP_DEFAULT);

We could have done the same in patch002 itself, or not?

4)
- /* Stop processing if the required number of slots have caught up. */
+ /*
+ * Stop processing if the required number of slots have caught up. In
+ * priority mode (FIRST N), this ensures we select the first N slots in
+ * sequential order. In quorum mode (ANY N), we still process in order for
+ * efficiency, stopping once we find any N.
+ */

I am not able to understand this comment very well despite of knowing
the code well. Do you think we can improve it? Or was the first
sentence enough (the existing one)? That explained it well.

~~

Reviewing and verifying further.

thanks
Shveta

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nitin Motiani 2026-05-14 11:12:33 [PATCH] Fix for bug #19474: LIKE fails to match literal backslashes with nondeterministic collations
Previous Message Amit Kapila 2026-05-14 10:48:33 Re: Proposal: Conflict log history table for Logical Replication