Re: Synchronous commit behavior during network outage

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Ondřej Žižka <ondrej(dot)zizka(at)stratox(dot)cz>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Synchronous commit behavior during network outage
Date: 2021-06-28 22:56:29
Message-ID: 6a052e81060824a8286148b1165bafedbd7c86cd.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2021-04-20 at 14:19 -0700, SATYANARAYANA NARLAPURAM wrote:
> One idea here is to make the backend ignore query
> cancellation/backend termination while waiting for the synchronous
> commit ACK. This way client never reads the data that was never
> flushed remotely. The problem with this approach is that your
> backends get stuck until your commit log record is flushed on the
> remote side. Also, the client can see the data not flushed remotely
> if the server crashes and comes back online. You can prevent the
> latter case by making a SyncRepWaitForLSN before opening up the
> connections to the non-superusers. I have a working prototype of this
> logic, if there is enough interest I can post the patch.

I didn't see a patch here yet, so I wrote a simple one for
consideration (attached).

The problem exists for both cancellation and termination requests. The
patch adds a GUC that makes SyncRepWaitForLSN keep waiting. It does not
ignore the requests; for instance, a termination request will still be
honored when it's done waiting for sync rep.

The idea of this GUC is not to wait forever (obviously), but to allow
the administrator (or an automated network agent) to be in control of
the logic:

If the primary is non-responsive, the administrator can decide to fail
over, knowing that all visible transactions on the primary are durable
on the standby (because any transaction that didn't make it to the
standby also didn't release locks yet). If the standby is non-
responsive, the administrator can intervene with something like:

ALTER SYSTEM SET synchronous_standby_names = '';
SELECT pg_reload_conf();

which will disable sync rep, allowing the primary to complete the query
and continue on without the standby; but in that case the admin must be
sure not to fail over until there's a new standby fully caught-up.

The patch may be somewhat controversial, so I'll wait for feedback
before documenting it properly.

Regards,
Jeff Davis

Attachment Content-Type Size
sync-wait.diff text/x-patch 4.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2021-06-28 23:26:56 Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
Previous Message John Naylor 2021-06-28 21:41:50 Re: cutting down the TODO list thread