Re: Standalone synchronous master

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Alexander Björnhagen <alex(dot)bjornhagen(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Standalone synchronous master
Date: 2011-12-26 13:35:58
Message-ID: CABUevExC-ySt9-64Dak=wmMM2pqXVyb7fu_iGxb_0Eo5nBTyRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 26, 2011 at 13:51, Alexander Björnhagen
<alex(dot)bjornhagen(at)gmail(dot)com> wrote:
> Hello and thank you for your feedback I appreciate it.
>
> Updated patch : sync-standalone-v2.patch
>
> I am sorry to be spamming the list but I just cleaned it up a little
> bit, wrote better comments and tried to move most of the logic into
> syncrep.c since that's where it belongs anyway and also fixed a small
> bug where standalone mode was disabled/enabled runtime via SIGHUP.

It's not spam when it's an updated patch ;)

>> Basically I like this whole idea, but I'd like to know why do you think this functionality is required?
>
> How should a synchronous master handle the situation where all
> standbys have failed ?
>
> Well, I think this is one of those cases where you could argue either
> way. Someone caring more about high availability of the system will
> want to let the master continue and just raise an alert to the
> operators. Someone looking for an absolute guarantee of data
> replication will say otherwise.

If you don't care about the absolute guarantee of data, why not just
use async replication? It's still going to replicate the data over to
the client as quickly as it can - which in the end is the same level
of guarantee that you get with this switch set, isn't it?

>> When is the replication mode switched from "standalone" to "sync"?
>
> Good question. Currently that happens when a standby server has
> connected and also been deemed suitable for synchronous commit by the
> master ( meaning that its name matches the config variable
> synchronous_standby_names ). So in a setup with both synchronous and
> asynchronous standbys, the master only considers the synchronous ones
> when deciding on standalone mode. The asynchronous standbys are
> “useless” to a synchronous master anyway.

But wouldn't an async standby still be a lot better than no standby at
all (standalone)?

>> The former might block the transactions for a long time until the standby has caught up with the master even though synchronous_standalone_master is enabled and a user wants to avoid such a downtime.
>
> If we a talking about a network “glitch”, than the standby would take
> a few seconds/minutes to catch up (not hours!) which is acceptable if
> you ask me.

So it's not Ok to block the master when the standby goes away, but it
is ok to block it when it comes back and catches up? The goes away
might be the same amount of time - or even shorter, depending on
exactly how the network works..

>> 1. While synchronous replication is running normally, replication
>> connection is closed because of
>>    network outage.
>> 2. The master works standalone because of
>> synchronous_standalone_master=on and some
>>    new transactions are committed though their WAL records are not
>> replicated to the standby.
>> 3. The master crashes for some reasons, the clusterware detects it and
>> triggers a failover.
>> 4. The standby which doesn't have recent committed transactions
> becomes the master at a failover...
>
>> Is this scenario acceptable?
>
> So you have two separate failures in less time than an admin would
> have time to react and manually bring up a new standby.

Given that one is a network failure, and one is a node failure, I
don't see that being strange at all. For example, a HA network
environment might cause a short glitch when it's failing over to a
redundant node - enough to bring down the replication connection and
require it to reconnect (during which the master would be ahead of the
slave).

In fact, both might well be network failures - one just making the
master completely inaccessble, and thus triggering the need for a
failover.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikhil Sontakke 2011-12-26 14:49:28 Re: Review: Non-inheritable check constraints
Previous Message Alexander Björnhagen 2011-12-26 12:51:07 Re: Standalone synchronous master