Synchronous commit behavior during network outage

From: Ondřej Žižka <ondrej(dot)zizka(at)stratox(dot)cz>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Synchronous commit behavior during network outage
Date: 2021-04-19 17:19:37
Message-ID: cac4b9df-92c6-77aa-687b-18b86cb13728@stratox.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello all,
I would like to know your opinion on the following behaviour I see for
PostgreSQL setup with synchronous replication.

This behaviour happens in a special use case. In this use case, there
are 2 synchronous replicas with the following config (truncated):

- 2 nodes
- synchronous_standby_names='*'
- synchronous_commit=remote_apply

With this setup run the following steps (LAN down - LAN between master
and replica):
-----------------
postgres=# truncate table a;
TRUNCATE TABLE
postgres=# insert into a values (1); -- LAN up, insert has been applied
to replica.
INSERT 0 1
Vypnu LAN na serveru se standby:
postgres=# insert into a values (2); --LAN down, waiting for a
confirmation from sync replica. In this situation cancel it (press CTRL+C)
^CCancel request sent
*WARNING:  canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not
have been replicated to the standby.
*INSERT 0 1
There will be warning that commit was performed only locally:
2021-04-12 19:55:53.063 CEST [26104] WARNING:  canceling wait for
synchronous replication due to user request
2021-04-12 19:55:53.063 CEST [26104] DETAIL:  The transaction has
already committed locally, but might not have been replicated to the
standby.

postgres=# insert into a values (2); --LAN down, waiting for a
confirmation from sync replica. In this situation cancel it (press CTRL+C)
^CCancel request sent
WARNING:  canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not
have been replicated to the standby.
INSERT 0 1
postgres=# insert into a values (2); --LAN down, waiting for sync
replica, second attempt, cancel it as well (CTRL+C)
^CCancel request sent
WARNING:  canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not
have been replicated to the standby.
INSERT 0 1
postgres=# update a set n=3 where n=2; --LAN down, waiting for sync
replica, cancel it (CTRL+C)
^CCancel request sent
WARNING:  canceling wait for synchronous replication due to user request
DETAIL:  The transaction has already committed locally, but might not
have been replicated to the standby.
UPDATE 2
postgres=# update a set n=3 where n=2; -- run the same update,because
data from the previous attempt was commited on master, it is sucessfull,
but no changes
UPDATE 0
postgres=# select * from a;
 n
---
 1
 3
 3
(3 rows)
postgres=#
------------------------

Now, there is only value 1 in the sync replica table (no other values),
data is not in sync. This is expected, after the LAN restore, data will
come sync again, but if the main/primary node will fail and we failover
to replica before the LAN is back up or the storage for this node would
be destroyed and data would not sync to replica before it, we will lose
data even if the client received successful commit (with a warning).
From the synchronous_commit=remote_write level and "higher", I would
expect, that when the remote application (doesn't matter if flush, write
or apply) would not be applied I would not receive a confirmation about
the commit (even with a warning). Something like, if there is no commit
from sync replica, there is no commit on primary and if someone performs
the steps above, the whole transaction will not send a confirmation.

This can cause issues if the application receives a confirmation about
the success and performs some follow-up steps e.g. create a user account
and sends a request to the mail system to create an account or create a
VPN account. If the scenario above happens, there can exist a VPN
account that does not have any presence in the central database and can
be a security issue.

I hope I explained it sufficiently. :-)

Do you think, that would be possible to implement a process that would
solve this use case?

Thank you
Ondrej

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-04-19 17:22:20 Re: Commit 86dc90056 - Rework planning and execution of UPDATE and DELETE
Previous Message Tom Lane 2021-04-19 17:03:56 Re: Commit 86dc90056 - Rework planning and execution of UPDATE and DELETE