Re: Synchronous commit behavior during network outage

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: SATYANARAYANA NARLAPURAM <satyanarlapuram(at)gmail(dot)com>, Ondřej Žižka <ondrej(dot)zizka(at)stratox(dot)cz>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Synchronous commit behavior during network outage
Date: 2021-07-02 06:39:39
Message-ID: 4B0CD464-74FA-4030-B8CC-30881D97A799@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> 2 июля 2021 г., в 10:59, Jeff Davis <pgsql(at)j-davis(dot)com> написал(а):
>
> On Wed, 2021-06-30 at 17:28 +0500, Andrey Borodin wrote:
>>> My patch also covers the backend termination case. Is there a
>>> reason
>>> you left that case out?
>>
>> Yes, backend termination is used by HA tool before rewinding the
>> node.
>
> Can't you just disable sync rep first (using ALTER SYSTEM SET
> synchronous_standby_names=''), which will unstick the backend, and then
> terminate it?
If the failover happens due to unresponsive node we cannot just turn off sync rep. We need to have some spare connections for that (number of stuck backends will skyrocket during network partitioning). We need available descriptors and some memory to fork new backend. We will need to re-read config. We need time to try after all.
At some failures we may lack some of these.

Partial degradation is already hard task. Without ability to easily terminate running Postgres HA tool will often resort to SIGKILL.

>
> If you don't handle the termination case, then there's still a chance
> for the transaction to become visible to other clients before its
> replicated.
Termination is admin command, they know what they are doing.
Cancelation is part of user protocol.

BTW can we have two GUCs? So that HA tool developers will decide on their own which guaranties they provide?

>
>> There is one more caveat we need to fix: we should prevent instant
>> recovery from happening.
>
> That can already be done with the restart_after_crash GUC.

Oh, I didn't know it, we will use it. Thanks!

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Haotian Wu 2021-07-02 06:40:52 Re: Add option --drop-cascade for pg_dump/restore
Previous Message Dilip Kumar 2021-07-02 06:33:38 Re: Logical replication - schema change not invalidating the relation cache