Some thoughts about multi-server sync rep configurations

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Some thoughts about multi-server sync rep configurations
Date: 2016-12-28 00:14:40
Message-ID: CAEepm=1GNCriNvWhPkWCqrsbXWGtWEEpvA-KnovMbht5ryzbmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Sync rep with multiple standbys allows queries run on standbys to see
transactions that haven't been flushed on the configured number of
standbys. That means that it's susceptible to lost updates or a kind
of "dirty read" in certain cluster reconfiguration scenarios. To
close that gap, we would need to introduce extra communication so that
standbys wait for flushes on other servers before a snapshot can be
used, in certain circumstances. That doesn't sound like a popular
performance feature, and I don't have a concrete proposal for that,
but wanted to raise the topic for discussion to see if others have
thought about it.

I speculate that there are three things that need to be aligned for
multi-server sync rep to be able to make complete guarantees about
durability, and pass the kind of testing that someone like Aphyr of
Jepsen[1] would throw at it. Of course you might argue that the
guarantees about reconfiguration that I'm assuming in the following
are not explicitly made anywhere, and I'd be very interested to hear
what others have to say about them. Suppose you have K servers and
you decide that you want to be able to lose N servers without data
loss. Then as far as I can see the three things are:

1. While automatic replication cluster management is not part of
Postgres, you must have a manual or automatic procedure with two
important properties in order for synchronous replication to be able
to reconfigure without data loss. At cluster reconfiguration time on
the loss of the primary, you must be able to contact at least K - N
servers and of those you must promote the server that has the highest
LSN, otherwise there is no way to know that the latest successfully
committed transaction is present in the new timeline. Furthermore,
you must be able to contact more than K / 2 servers (a majority) to
avoid split-brain syndrome.

2. The primary must wait for N standby servers to acknowledge
flushing before returning from commit, as we do.

3. No server must allow a transaction to be visible that hasn't been
flushed on N standby servers. We already prevent that on the primary,
but not on standbys. You might see a transaction on a given standby,
then lose that standby and the primary, and then a new primary might
be elected that doesn't have that transaction. We don't have this
problem if you only run queries on the primary, and we don't have it
on single standby configurations ie K = 2 and N = 1. But as soon as K
> 2 and N > 1, we can have the problem on standbys.

Example:

You have 2 servers in London, 2 in New York and 2 in Tokyo. You
enable synchronous replication with N = 3. A transaction updates a
record and commits locally on host "london1", and begins waiting for 3
servers to respond. A network fault prevents messages from London
reaching the other two data centres, because the rack is on fire. But
"london2" receives and applies the WAL. Now another session sees this
transaction on "london2" and reports a fact that it represents to a
customer. Finally failover software or humans determine that it's
time to promote a new primary server in Tokyo. The fact reported to
the customer has evaporated; that was a kind of "dirty read" that
might have consequences for your business.

[1] https://aphyr.com/tags/jepsen

--
Thomas Munro
http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2016-12-28 00:17:23 Re: proposal: session server side variables
Previous Message Nikita Glukhov 2016-12-28 00:13:56 Re: PATCH: recursive json_populate_record()