Re: Synchronous commit behavior during network outage

From: Ondřej Žižka <ondrej(dot)zizka(at)stratox(dot)cz>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Synchronous commit behavior during network outage
Date: 2021-04-20 18:05:31
Message-ID: 39d141cf-89c0-6aaa-c8e9-500c05d0d744@stratox.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I am sorry, I forgot mentioned, that in the second situation I added a
primary key to the table.

Ondrej

On 20/04/2021 18:49, Ondřej Žižka wrote:
> Hello Aleksander,
>
> Thank you for the reaction. This was tested on version 13.2.
>
> There are also other possible situations with the same setup and
> similar issue:
>
> -----------------
> When the background process on server fails....
>
> On postgresql1:
> tecmint=# select * from a; --> LAN on sync replica is OK
>  id
> ----
>   1
> (1 row)
>
> tecmint=# insert into a values (2); ---> LAN on sync replica is DOWN
> and insert is waiting. During this time kill the background process on
> the PostgreSQL server for this session
> WARNING:  canceling the wait for synchronous replication and
> terminating connection due to administrator command
> DETAIL:  The transaction has already committed locally, but might not
> have been replicated to the standby.
> server closed the connection unexpectedly
>     This probably means the server terminated abnormally
>     before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
> (2 rows)
>
> tecmint=# ---> LAN on sync replica is still DOWN
>
> The potgres session will restore after the background process failed.
> When you run select on master, it still looks OK. But data is still
> not replicated on the sync replica. If we lost the master now, we
> would lost this data as well.
>
> **************
> Another case
> **************
>
> Kill the client process.
>
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
>   3
> (3 rows)
> tecmint=#                --> Disconnect the sync replica now. LAN on
> replica is DOWN
> tecmint=# insert into a values (4); --> Kill the client process
> Terminated
> xzizka(at)service-vm:~$ psql -U postgres -h 192.168.122.6 -p 5432 -d tecmint
> Password for user postgres:
> psql (13.2 (Debian 13.2-1.pgdg100+1))
> Type "help" for help.
>
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
>   3
> (3 rows)
>
> tecmint=# --> Number 4 is not there. Now switch the LAN on sync
> replica ON.
>
> ----------
>
> Result from sync replica after the LAN is again UP:
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
>   3
>   4
> (4 rows)
>
>
> In this situation, try to insert the number 4 again to the table.
>
> tecmint=# select * from a;
>  id
> ----
>   1
>   2
>   3
> (3 rows)
>
> tecmint=# insert into a values (4);
> ERROR:  duplicate key value violates unique constraint "a_pkey"
> DETAIL:  Key (id)=(4) already exists.
> tecmint=#
>
> This is really strange... Application can be confused, It is not
> possible to insert record, which is not there, but some systems which
> use the sync node as a read replica maybe already read that record
> from the sync replica database and done some steps which can cause
> issues and can be hard to track.
>
> If I say, that it would be hard to send the CTRL+C to the database
> from the client, I need to say, that the 2 situations I described here
> can happen in real.
>
> What do you think?
>
> Thank you and regards
> Ondrej
>
> On 20/04/2021 17:23, Aleksander Alekseev wrote:
>> Hi Ondřej,
>>
>> Thanks for the report. It seems to be a clear violation of what is
>> promised in the docs. Although it's unlikely that someone implemented
>> an application which deals with important data and "pressed Ctr+C" as
>> it's done in psql. So this might be not such a critical issue after
>> all. BTW what version of PostgreSQL are you using?
>>
>>
>> On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka
>> <ondrej(dot)zizka(at)stratox(dot)cz> wrote:
>>> Hello all,
>>> I would like to know your opinion on the following behaviour I see
>>> for PostgreSQL setup with synchronous replication.
>>>
>>> This behaviour happens in a special use case. In this use case,
>>> there are 2 synchronous replicas with the following config (truncated):
>>>
>>> - 2 nodes
>>> - synchronous_standby_names='*'
>>> - synchronous_commit=remote_apply
>>>
>>>
>>> With this setup run the following steps (LAN down - LAN between
>>> master and replica):
>>> -----------------
>>> postgres=# truncate table a;
>>> TRUNCATE TABLE
>>> postgres=# insert into a values (1); -- LAN up, insert has been
>>> applied to replica.
>>> INSERT 0 1
>>> Vypnu LAN na serveru se standby:
>>> postgres=# insert into a values (2); --LAN down, waiting for a
>>> confirmation from sync replica. In this situation cancel it (press
>>> CTRL+C)
>>> ^CCancel request sent
>>> WARNING:  canceling wait for synchronous replication due to user
>>> request
>>> DETAIL:  The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> There will be warning that commit was performed only locally:
>>> 2021-04-12 19:55:53.063 CEST [26104] WARNING:  canceling wait for
>>> synchronous replication due to user request
>>> 2021-04-12 19:55:53.063 CEST [26104] DETAIL:  The transaction has
>>> already committed locally, but might not have been replicated to the
>>> standby.
>>>
>>> postgres=# insert into a values (2); --LAN down, waiting for a
>>> confirmation from sync replica. In this situation cancel it (press
>>> CTRL+C)
>>> ^CCancel request sent
>>> WARNING:  canceling wait for synchronous replication due to user
>>> request
>>> DETAIL:  The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> postgres=# insert into a values (2); --LAN down, waiting for sync
>>> replica, second attempt, cancel it as well (CTRL+C)
>>> ^CCancel request sent
>>> WARNING:  canceling wait for synchronous replication due to user
>>> request
>>> DETAIL:  The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> INSERT 0 1
>>> postgres=# update a set n=3 where n=2; --LAN down, waiting for sync
>>> replica, cancel it (CTRL+C)
>>> ^CCancel request sent
>>> WARNING:  canceling wait for synchronous replication due to user
>>> request
>>> DETAIL:  The transaction has already committed locally, but might
>>> not have been replicated to the standby.
>>> UPDATE 2
>>> postgres=# update a set n=3 where n=2; -- run the same
>>> update,because data from the previous attempt was commited on
>>> master, it is sucessfull, but no changes
>>> UPDATE 0
>>> postgres=# select * from a;
>>>   n
>>> ---
>>>   1
>>>   3
>>>   3
>>> (3 rows)
>>> postgres=#
>>> ------------------------
>>>
>>> Now, there is only value 1 in the sync replica table (no other
>>> values), data is not in sync. This is expected, after the LAN
>>> restore, data will come sync again, but if the main/primary node
>>> will fail and we failover to replica before the LAN is back up or
>>> the storage for this node would be destroyed and data would not sync
>>> to replica before it, we will lose data even if the client received
>>> successful commit (with a warning).
>>>  From the synchronous_commit=remote_write level and "higher", I
>>> would expect, that when the remote application (doesn't matter if
>>> flush, write or apply) would not be applied I would not receive a
>>> confirmation about the commit (even with a warning). Something like,
>>> if there is no commit from sync replica, there is no commit on
>>> primary and if someone performs the steps above, the whole
>>> transaction will not send a confirmation.
>>>
>>> This can cause issues if the application receives a confirmation
>>> about the success and performs some follow-up steps e.g. create a
>>> user account and sends a request to the mail system to create an
>>> account or create a VPN account. If the scenario above happens,
>>> there can exist a VPN account that does not have any presence in the
>>> central database and can be a security issue.
>>>
>>> I hope I explained it sufficiently. :-)
>>>
>>> Do you think, that would be possible to implement a process that
>>> would solve this use case?
>>>
>>> Thank you
>>> Ondrej
>>
>>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey Borodin 2021-04-20 18:18:12 Re: ML-based indexing ("The Case for Learned Index Structures", a paper from Google)
Previous Message Ondřej Žižka 2021-04-20 18:00:22 Re: Synchronous commit behavior during network outage