Re: Avoiding data loss with synchronous replication

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding data loss with synchronous replication
Date: 2021-07-24 10:53:15
Message-ID: D46D857F-5465-4688-BD6C-280942D28C39@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> 23 июля 2021 г., в 22:54, Bossart, Nathan <bossartn(at)amazon(dot)com> написал(а):
>
> On 7/23/21, 4:33 AM, "Andrey Borodin" <x4mmm(at)yandex-team(dot)ru> wrote:
>> Thanks for you interest in the topic. I think in the thread [0] we almost agreed on general design.
>> The only left question is that we want to threat pg_ctl stop and kill SIGTERM differently to pg_terminate_backend().
>
> I didn't get the idea that there was a tremendous amount of support
> for the approach to block canceling waits for synchronous replication.
> FWIW this was my initial approach as well, but I've been trying to
> think of alternatives.
>
> If we can gather support for some variation of the block-cancels
> approach, I think that would be preferred over my proposal from a
> complexity standpoint.
Let's clearly enumerate problems of blocking.
It's been mentioned that backend is not responsive when cancelation is blocked. But on the contrary, it's very responsive.

postgres=# alter system set synchronous_standby_names to 'bogus';
ALTER SYSTEM
postgres=# alter system set synchronous_commit_cancelation TO off ;
ALTER SYSTEM
postgres=# select pg_reload_conf();
2021-07-24 15:35:03.054 +05 [10452] LOG: received SIGHUP, reloading configuration files
l
---
t
(1 row)
postgres=# begin;
BEGIN
postgres=*# insert into t1 values(0);
INSERT 0 1
postgres=*# commit ;
^CCancel request sent
WARNING: canceling wait for synchronous replication requested, but cancelation is not allowed
DETAIL: The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.
^CCancel request sent
WARNING: canceling wait for synchronous replication requested, but cancelation is not allowed
DETAIL: The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.

It tells clearly what's wrong. If it's still not enough, let's add hint about synchronous standby names.

Are there any other problems with blocking cancels?

> Robert's idea to provide a way to understand
> the intent of the cancellation/termination request [0] could improve
> matters. Perhaps adding an argument to pg_cancel/terminate_backend()
> and using different signals to indicate that we want to cancel the
> wait would be something that folks could get on board with.

Semantics of cancelation assumes correct query interruption. This is not possible already when we committed locally. There cannot be any correct cancelation. And I don't think it worth to add incorrect cancelation.

Interestingly, converting transaction to 2PC is a neat idea when the backend is terminated. It provides more guaranties that transaction will commit correctly even after restart. But we may be short of max_prepared_xacts slots...
Anyway backend termination bothers me a lot less than cancelation - drivers do not terminate queries on their own. But they cancel queries by default.

Thanks!

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2021-07-24 13:14:42 Maintain the pathkesy for subquery from outer side information
Previous Message Andrey Borodin 2021-07-24 10:52:09 Re: Avoiding data loss with synchronous replication