Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

From: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Andrey Borodin <amborodin86(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Date: 2026-03-18 07:28:44
Message-ID: CAHg+QDdd7BXB9HD9ddevk_D5TtweEBantcvJ5up5hznryZ33_w@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Reviving this thread.

On Sun, Jan 29, 2023 at 9:55 PM Bharath Rupireddy <
bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:

> For proc die, it looks like the suggestion was to process it
> immediately and upon next restart, don't allow user connections unless
> all sync standbys were caught up. However, we need to be able to allow
> replication connections from standbys so that they'll be able to
> stream the needed WAL and catch up with primary, allow superuser or
> users with pg_monitor role to connect to perform ALTER SYSTEM to
> remove the unresponsive sync standbys if any from the list or disable
> sync replication altogether or monitor for flush lsn/catch up status.
> And block all other connections. Note that replication, superuser and
> users with pg_monitor role connections are allowed only after the
> server reaches a consistent state not before that to not read any
> inconsistent data.
>

Allowing replication, superuser and pg_monitor seems reasonable to me.

>
> The trickiest part of doing the above is how we detect upon restart
> that the server received proc die while waiting for sync replication
> ACK. One idea might be to set a flag in the control file before the
> crash. Second idea might be to write a marker file (although I don't
> favor this idea); presence indicates that the server was waiting for
> sync replication ACK before the crash. However, we may not detect all
> sorts of crashes in a backend when it is waiting for sync replication
> ACK to do any of these two ideas. Therefore, this may not be a
> complete solution.
>

You cannot control the crash, it can be a simple power failure too and none
of them could have reached the disk.
Additionally, this is in a critical transaction commit path.

>
> Third idea might be to just let the primary wait for sync standbys to
> catch up upon restart irrespective of whether it was crashed or not
> while waiting for sync replication ACK. While this idea works well
> without having to detect all sorts of crashes, the primary may not
> come up if any unresponsive standbys are present (currently, the
> primary continues to be operational for read-only queries at least
> irrespective of whether sync standbys have caught up or not).
>

I prefer this approach because depending on the quorum policy defined in
the synchrnous_standby_names, the primary will open connections for
read/writes.
If there is no progress from sync standbys then Postgres admin has to jump
in regardless.

Thanks,
Satya

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2026-03-18 07:29:09 Re: SQL Property Graph Queries (SQL/PGQ)
Previous Message Henson Choi 2026-03-18 07:21:36 Re: SQL Property Graph Queries (SQL/PGQ)