|From:||Bruce Momjian <bruce(at)momjian(dot)us>|
|To:||Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>|
|Cc:||Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>|
|Subject:||Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
On Sat, Oct 1, 2022 at 06:59:26AM +0530, Bharath Rupireddy wrote:
> > I have always felt this has to be done at the server level, meaning when
> > a synchronous_standby_names replica is not responding after a certain
> > timeout, the administrator must be notified by calling a shell command
> > defined in a GUC and all sessions will ignore the replica. This gives a
> > much more predictable and useful behavior than the one in the patch ---
> > we have discussed this approach many times on the email lists.
> IIUC, each walsender serving a sync standby will determine that the
> sync standby isn't responding for a configurable amount of time (less
> than wal_sender_timeout) and calls shell command to notify the admin
> if there are any backends waiting for sync replication in
> SyncRepWaitForLSN(). The shell command then provides the unresponsive
> sync standby name at the bare minimum for the admin to ignore it as
> sync standby/remove it from synchronous_standby_names to continue
> further. This still requires manual intervention which is a problem if
> running postgres server instances at scale. Also, having a new shell
As I highlighted above, by default you notify the administrator that a
sychronous replica is not responding and then ignore it. If it becomes
responsive again, you notify the administrator again and add it back as
a sychronous replica.
> command in any form may pose security risks. I'm not sure at this
> point how this new timeout is going to work alongside
We have archive_command, so I don't see a problem with another shell
> I'm thinking about the possible options that an admin has to get out
> of this situation:
> 1) Removing the standby from synchronous_standby_names.
Yes, see above. We might need a read-only GUC that reports which
sychronous replicas are active. As you can see, there is a lot of API
design required here, but this is the most effective approach.
> 2) Fixing the sync standby, by restarting or restoring the lost part
> (such as network or some other).
> (1) is something that postgres can help admins get out of the problem
> easily and automatically without any intervention. (2) is something
> postgres can't do much about.
> How about we let postgres automatically remove an unresponsive (for a
> pre-configured time) sync standby from synchronous_standby_names and
> inform the user (via log message and via new walsender property and
> pg_stat_replication for monitoring purposes)? The users can then
> detect such standbys and later try to bring them back to the sync
> standbys group or do other things. I believe that a production level
> postgres HA with sync standbys will have monitoring to detect the
> replication lag, failover decision etc via monitoring
> pg_stat_replication. With this approach, a bit more monitoring is
> needed. This solution requires less or no manual intervention and
> scales well. Please note that I haven't studied the possibilities of
> implementing it yet.
Yes, see above.
> > Once we have that, we can consider removing the cancel ability while
> > waiting for synchronous replicas (since we have the timeout) or make it
> > optional. We can also consider how do notify the administrator during
> > query cancel (if we allow it), backend abrupt exit/crash, and
> Yeah. If we have the
> timeout-and-auto-removal-of-standby-from-sync-standbys-list solution,
> the users can then choose to disable processing query cancels/proc
> dies while waiting for sync replication in SyncRepWaitForLSN().
Yes. We might also change things so a query cancel that happens during
sychronous replica waiting can only be done by an administrator, not the
session owner. Again, lots of design needed here.
> > if we
> > should allow users to specify a retry interval to resynchronize the
> > synchronous replicas.
> This is another interesting thing to consider if we were to make the
> auto-removed (by the above approach) standby a sync standby again
> without manual intervention.
Yes, see above. You are addressing the right questions here. :-)
Indecision is a decision. Inaction is an action. Mark Batterson
|Next Message||Andres Freund||2022-10-05 21:10:22||Re: meson PGXS compatibility|
|Previous Message||Tom Lane||2022-10-05 20:58:46||Re: meson PGXS compatibility|