Causal reads take II

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Causal reads take II
Date: 2017-01-03 01:43:50
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

Hi hackers,

Here is a new version of my "causal reads" patch (see the earlier
thread from the 9.6 development cycle[1]), which provides a way to
avoid stale reads when load balancing with streaming replication.

To try it out:

Set up a primary and some standbys, and put "causal_reads_timeout =
4s" in the primary's postgresql.conf. Then SET causal_reads = on and
experiment with various workloads, watching your master's log and
looking at pg_stat_replication. For example you could try out
test-causal-reads.c with --causal-reads --check (from the earlier
thread) or write something similar, and verify the behaviour while
killing, pausing, overwhelming servers etc.

Here's a brief restatement of the problem I'm trying to address and
how this patch works:

In 9.6 we got a new synchronous_commit level "remote_apply", which
causes committing transactions to block until the commit record has
been applied on the current synchronous standby server. In 10devel
can now be servers plural. That's useful because it means that a
client can run tx1 on the primary and then run tx2 on an appropriate
standby, or cause some other client to do so, and be sure that tx2 can
see tx1. Tx2 can be said to be "causally dependent" on tx1 because
clients expect tx2 to see tx1, because they know that tx1 happened
before tx2.

In practice there are complications relating to failure and
transitions. How should you find an appropriate standby? Suppose you
have a primary and N standbys, you set synchronous_standby_names to
wait for all N standbys, and you set synchronous_commit to
remote_apply. Then the above guarantee of visibility of tx1 by tx2
works, no matter which server you run tx2 on. Unfortunately, if one
of your standby servers fails or there is a network partition, all
commits will block until you fix that. So you probably want to set
synchronous_standby_names to wait for a subset of your set of
standbys. Now you can lose some number of standby servers without
holding up commits on the primary, but the visibility guarantee for
causal dependencies is lost! How can a client know for certain
whether tx2 run on any given standby can see a transaction tx1 that it
has heard about? If you're using the new "ANY n" mode then the subset
of standbys that have definitely applied tx1 is not known to any
client; if you're using the traditional FIRST mode it's complicated
during transitions (you might be talking to a standby that has
recently lost its link to the primary and the primary could have
decided to wait for the next highest priority standby instead and then
returned from COMMIT successfully).

This patch provides the following guarantee: if causal_reads is on
for both tx1 and tx2, then after tx1 returns, tx2 will either see tx1
or fail with an error indicating that the server is currently
unavailable for causal reads. This guarantee is upheld even if there
is a network partition and the standby running tx2 is unable to
communicate with the primary server, but requires the system clocks of
all standbys to differ from the primary's by less than a certain
amount of allowable skew that is accounted for in the algorithm
(causal_reads_timeout / 4, see README.causal_reads for gory details).

It works by sending a stream of "leases" to standbys that are applying
fast enough. These leases grant the standby the right to assume that
all transactions that were run with causal_reads = on and have
returned control have already been applied locally, without doing any
communication or waiting, for a limited time. Leases are promises
made by the primary that it will wait for all such transactions to be
applied on each 'available' standby or for available standbys' leases
to be revoked because they're lagging too much, and for any revoked
leases to expire.

As discussed in the earlier thread, there are other ways that tx2 run
on a standby could get a useful guarantee about the visibility of an
early transaction tx1 that the client knows about. (1) User-managed
"causality tokens": Clients could somehow obtain the LSN of commit
tx1 (or later), and then tx2 could explicitly wait for that LSN to be
applied, as proposed by Ivan Kartyshov[2] and others; if you aren't
also using sync rep for data loss avoidance, then tx1 will return from
committing without waiting for standbys, and by the time tx2 starts on
a standby it may find that the LSN has already been applied and not
have to wait at all. That is definitely good. Unfortunately it also
transfers the problem of tracking causal dependencies between
transactions to client code, which is a burden on the application
developer and difficult to retrofit. (2) Middleware-managed
causality tokens: Something like pgpool or pgbouncer or some other
proxy could sit in front of all of your PostgreSQL servers and watch
all transactions and do the LSN tracking for you, inserting waits
where appropriate so that no standby query ever sees a snapshot that
doesn't include any commit that any client has heard about; that
requires tx2 to wait for transactions that may be later than tx1 to be
applied potentially slowing down every read query, and requires
pushing all transactions through a single central process thereby
introducing its own failover problem with associated transition
failure mode that could break our guarantee if somehow two of these
proxies are ever active at once.

Don't get me wrong, I think those are good ideas: let's do those too.
I guess that people working on logical multi-master replication might
eventually want a general concept of causality tokens which could
include some kind of vector clock. But I don't see this proposal as
conflicting with any of that. It's a set of trade-offs that provides
a simple solution for users who want to be able to talk directly to
any PostgreSQL standby server out of the box without pushing
everything through a central observer, and who want to be able to
enable this for existing applications without having to rewrite them
to insert complicated code to track and communicate LSNs.

Some assorted thoughts and things I'd love to hear your ideas on:

I admit that it has a potentially confusing relationship with
synchronous replication. It is presented as a separate feature, and
you can use both features together or use them independently:
synchronous_standby_names and synchronous_commit are for controlling
your data loss risk, and causal_reads_standby_names and causal_reads
are for controlling distributed read consistency. Perhaps the
causal_reads GUC should support different levels rather that using
on/off; the mode described above could be enabled with something =
'causal_read_lease', leaving room for other modes. Maybe the whole
feature needs a better name: I borrowed "causal reads" from Galera's
wsrep_causal_reads/wsrep_sync_wait. That system makes readers (think
standbys) wait for the global end of WAL to be applied locally at the
start of every transaction, which could also be a potential future
mode for us, but I thought it was much more interesting to have
wait-free reads on standbys, especially if you already happen to be
waiting on the primary because you want to avoid data loss with
syncrep. To achieve that I added system-clock-based leases. I
suspect some people will dislike that part: the guarantee includes the
caveat about the maximum difference between system clocks, and the
patch doesn't do anything as clever as Google's Spanner/Truetime
system or come with a free built-in atomic clock, so it relies on
setting the max clock skew conservatively and making sure you have NTP
set up correctly (for example, things reportedly got a bit messy for a
short time after the recent leap second if you happened to have only
one server from in your ntpd.conf and were unlucky). I
considered ways to make causal reads an extension, but it'd need
fairly invasive hooks including the ability to change replication wire
protocol messages.

Long term, I think it would be pretty cool if we could develop a set
of features that give you distributed sequential consistency on top of
streaming replication. Something like (this | causality-tokens) +

The patch:

The replay lag tracking patch this depends on is in the current
commitfest[1] and is presented as an independent useful feature.
Please find two patches to implement causal reads for the open CF
attached. First apply replay-lag-v16.patch, then
refactor-syncrep-exit-v16.patch, then causal-reads-v16.patch.

Thanks for reading!


Thomas Munro

Attachment Content-Type Size
refactor-syncrep-exit-v16.patch application/octet-stream 4.9 KB
causal-reads-v16.patch application/octet-stream 75.8 KB


Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-01-03 02:03:09 Re: What is "index returned tuples in wrong order" for recheck supposed to guard against?
Previous Message Lewis, Ian (Microstar Laboratories) 2017-01-03 01:03:45 Re: Cluster wide option to control symbol case folding