Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Subject: Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Date: 2022-10-05 21:00:03
Message-ID: Yz3wUxW2a3raVbfJ@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 1, 2022 at 06:59:26AM +0530, Bharath Rupireddy wrote:
> > I have always felt this has to be done at the server level, meaning when
> > a synchronous_standby_names replica is not responding after a certain
> > timeout, the administrator must be notified by calling a shell command
> > defined in a GUC and all sessions will ignore the replica. This gives a
------------------------------------
> > much more predictable and useful behavior than the one in the patch ---
> > we have discussed this approach many times on the email lists.
>
> IIUC, each walsender serving a sync standby will determine that the
> sync standby isn't responding for a configurable amount of time (less
> than wal_sender_timeout) and calls shell command to notify the admin
> if there are any backends waiting for sync replication in
> SyncRepWaitForLSN(). The shell command then provides the unresponsive
> sync standby name at the bare minimum for the admin to ignore it as
> sync standby/remove it from synchronous_standby_names to continue
> further. This still requires manual intervention which is a problem if
> running postgres server instances at scale. Also, having a new shell

As I highlighted above, by default you notify the administrator that a
sychronous replica is not responding and then ignore it. If it becomes
responsive again, you notify the administrator again and add it back as
a sychronous replica.

> command in any form may pose security risks. I'm not sure at this
> point how this new timeout is going to work alongside
> wal_sender_timeout.

We have archive_command, so I don't see a problem with another shell
command.

> I'm thinking about the possible options that an admin has to get out
> of this situation:
> 1) Removing the standby from synchronous_standby_names.

Yes, see above. We might need a read-only GUC that reports which
sychronous replicas are active. As you can see, there is a lot of API
design required here, but this is the most effective approach.

> 2) Fixing the sync standby, by restarting or restoring the lost part
> (such as network or some other).
>
> (1) is something that postgres can help admins get out of the problem
> easily and automatically without any intervention. (2) is something
> postgres can't do much about.
>
> How about we let postgres automatically remove an unresponsive (for a
> pre-configured time) sync standby from synchronous_standby_names and
> inform the user (via log message and via new walsender property and
> pg_stat_replication for monitoring purposes)? The users can then
> detect such standbys and later try to bring them back to the sync
> standbys group or do other things. I believe that a production level
> postgres HA with sync standbys will have monitoring to detect the
> replication lag, failover decision etc via monitoring
> pg_stat_replication. With this approach, a bit more monitoring is
> needed. This solution requires less or no manual intervention and
> scales well. Please note that I haven't studied the possibilities of
> implementing it yet.
>
> Thoughts?

Yes, see above.

> > Once we have that, we can consider removing the cancel ability while
> > waiting for synchronous replicas (since we have the timeout) or make it
> > optional. We can also consider how do notify the administrator during
> > query cancel (if we allow it), backend abrupt exit/crash, and
>
> Yeah. If we have the
> timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
> the users can then choose to disable processing query cancels/proc
> dies while waiting for sync replication in SyncRepWaitForLSN().

Yes. We might also change things so a query cancel that happens during
sychronous replica waiting can only be done by an administrator, not the
session owner. Again, lots of design needed here.

> > if we
> > should allow users to specify a retry interval to resynchronize the
> > synchronous replicas.
>
> This is another interesting thing to consider if we were to make the
> auto-removed (by the above approach) standby a sync standby again
> without manual intervention.

Yes, see above. You are addressing the right questions here. :-)

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EDB https://enterprisedb.com

Indecision is a decision. Inaction is an action. Mark Batterson

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-10-05 21:10:22 Re: meson PGXS compatibility
Previous Message Tom Lane 2022-10-05 20:58:46 Re: meson PGXS compatibility