Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>
Subject: Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Date: 2022-05-09 09:20:21
Message-ID: CALj2ACWoB3k0kfjA7JxJgskVGGeE3jWzmGRPjP9QTRSCgSjhOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 26, 2022 at 11:57 AM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> wrote:
>
> On Mon, 2022-04-25 at 19:51 +0530, Bharath Rupireddy wrote:
> > With synchronous replication typically all the transactions (txns)
> > first locally get committed, then streamed to the sync standbys and
> > the backend that generated the transaction will wait for ack from sync
> > standbys. While waiting for ack, it may happen that the query or the
> > txn gets canceled (QueryCancelPending is true) or the waiting backend
> > is asked to exit (ProcDiePending is true). In either of these cases,
> > the wait for ack gets canceled and leaves the txn in an inconsistent
> > state [...]
> >
> > Here's a proposal (mentioned previously by Satya [1]) to avoid the
> > above problems:
> > 1) Wait a configurable amount of time before canceling the sync
> > replication by the backends i.e. delay processing of
> > QueryCancelPending and ProcDiePending in Introduced a new timeout GUC
> > synchronous_replication_naptime_before_cancel, when set, it will let
> > the backends wait for the ack before canceling the synchronous
> > replication so that the transaction can be available in sync standbys
> > as well.
> > 2) Wait for sync standbys to catch up upon restart after the crash or
> > in the next txn after the old locally committed txn was canceled.
>
> While this may mitigate the problem, I don't think it will deal with
> all the cases which could cause a transaction to end up committed locally,
> but not on the synchronous standby. I think that only using the full
> power of two-phase commit can make this bulletproof.

Not sure if it's recommended to use 2PC in postgres HA with sync
replication where the documentation says that "PREPARE TRANSACTION"
and other 2PC commands are "intended for use by external transaction
management systems" and with explicit transactions. Whereas, the txns
within a postgres HA with sync replication always don't have to be
explicit txns. Am I missing something here?

> Is it worth adding additional complexity that is not a complete solution?

The proposed approach helps to avoid some common possible problems
that arise with simple scenarios (like cancelling a long running query
while in SyncRepWaitForLSN) within sync replication.

[1] https://www.postgresql.org/docs/devel/sql-prepare-transaction.html

Regards,
Bharath Rupireddy.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2022-05-09 09:44:03 Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Previous Message Alvaro Herrera 2022-05-09 09:05:12 Re: Support logical replication of DDLs