Proposal: "Causal reads" mode for load balancing reads without stale data

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Proposal: "Causal reads" mode for load balancing reads without stale data
Date: 2015-11-11 05:37:43
Message-ID: CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

Many sites use hot standby servers to spread read-heavy workloads over more
hardware, or at least would like to. This works well today if your
application can tolerate some time lag on standbys. The problem is that
there is no guarantee of when a particular commit will become visible for
clients connected to standbys. The existing synchronous commit feature is
no help here because it guarantees only that the WAL has been flushed on
another server before commit returns. It says nothing about whether it has
been applied or whether it has been applied on the standby that you happen
to be talking to.

A while ago I posted a small patch[1] to allow synchronous_commit to wait
for remote apply on the current synchronous standby, but (as Simon Riggs
rightly pointed out in that thread) that part isn't the main problem. It
seems to me that the main problem for a practical 'writer waits' system is
how to support a dynamic set of servers, gracefully tolerating failures and
timing out laggards, while also providing a strong guarantee during any
such transitions. Now I would like to propose something to do that, and
share a proof-of-concept patch.

=== PROPOSAL ===

The working name for the proposed feature is "causal reads", because it
provides a form of "causal consistency"[2] (and "read-your-writes"
consistency) no matter which server the client is connected to. There is a
similar feature by the same name in another product (albeit implemented
differently -- 'reader waits'; more about that later). I'm not wedded to
the name.

The feature allows arbitrary read-only transactions to be run on any hot
standby, with a specific guarantee about the visibility of preceding
transactions. The guarantee is that if you set a new GUC "causal_reads =
on" in any pair of consecutive transactions (tx1, tx2) where tx2 begins
after tx1 successfully returns, then tx2 will either see tx1 or fail with a
new error "standby is not available for causal reads", no matter which
server it runs on. A discovery mechanism is also provided, giving an
instantaneous snapshot of the set of standbys that are currently available
for causal reads (ie won't raise the error), in the form of a new column in
pg_stat_replication.

For example, a web server might run tx1 to insert a new row representing a
message in a discussion forum on the primary server, and then send the user
to another web page that runs tx2 to load all messages in the forum on an
arbitrary hot standby server. If causal_reads = on in both tx1 and tx2
(for example, because it's on globally), then tx2 is guaranteed to see the
new post, or get a (hopefully rare) error telling the client to retry on
another server.

Very briefly, the approach is:
1. The primary tracks apply lag on each standby (including between
commits).
2. The primary deems standbys 'available' for causal reads if they are
applying WAL and replying to keepalives fast enough, and periodically sends
the standby an authorization to consider itself available for causal reads
until a time in the near future.
3. Commit on the primary with "causal_reads = on" waits for all
'available' standbys either to apply the commit record, or to cease to be
'available' and begin raising the error if they are still alive (because
their authorizations have expired).
4. Standbys can start causal reads transactions only while they have an
authorization with an expiry time in the future; otherwise they raise an
error when an initial snapshot is taken.

In a follow-up email I can write about the design trade-offs considered
(mainly 'writer waits' vs 'reader waits'), comparison with some other
products, method of estimating replay lag, wait and timeout logic and how
it maintains the guarantee in various failure scenarios, logic for standbys
joining and leaving, implications of system clock skew between servers, or
any other questions you may have, depending on feedback/interest (but see
comments in the attached patch for some of those subjects). For now I
didn't want to clog up the intertubes with too large a wall of text.

=== PROOF-OF-CONCEPT ===

Please see the POC patch attached. It adds two new GUCs. After setting up
one or more hot standbys as per usual, simply add "causal_reads_timeout =
4s" to the primary's postgresql.conf and restart. Now, you can set
"causal_reads = on" in some/all sessions to get guaranteed causal
consistency. Expected behaviour: the causal reads guarantee is maintained
at all times, even when you overwhelm, kill, crash, disconnect, restart,
pause, add and remove standbys, and the primary drops them from the set it
waits for in a timely fashion. You can monitor the system with the
replay_lag and causal_reads_status in pg_stat_replication and some state
transition LOG messages on the primary. (The patch also supports
"synchronous_commit = apply", but it's not clear how useful that is in
practice, as already discussed.)

Lastly, a few notes about how this feature related to some other work:

The current version of this patch has causal_reads as a feature separate
from synchronous_commit, from a user's point of view. The thinking behind
this is that load balancing and data loss avoidance are separate concerns:
synchronous_commit deals with the latter, and causal_reads with the
former. That said, existing SyncRep machinery is obviously used
(specifically SyncRep queues, with a small modification, as a way to wait
for apply messages to arrive from standbys). (An earlier prototype had
causal reads as a new level for synchronous_commit and associated states as
new walsender states above 'streaming'. When contemplating how to combine
this proposal with the multiple-synchronous-standby patch, some colleagues
and I came around to the view that the concerns are separate. The reason
for wanting to configure complicated quorum definitions is to control data
loss risks and has nothing to do with load balancing requirements, so we
thought the features should probably be separate.)

The multiple-synchronous-servers patch[3] could be applied or not
independently of this feature as a result of that separation, as it doesn't
use synchronous_standby_names or indeed any kind of statically defined
quorum.

The standby WAL writer patch[4] would significantly improve walreceiver
performance and smoothness which would work very well with this proposal.

Please let me know what you think!

Thanks,

[1]
http://www.postgresql.org/message-id/flat/CAEepm=1fqkivL4V-OTPHwSgw4aF9HcoGiMrCW-yBtjipX9gsag(at)mail(dot)gmail(dot)com

[2] From http://queue.acm.org/detail.cfm?id=1466448

"Causal consistency. If process A has communicated to process B that it has
updated a data item, a subsequent access by process B will return the
updated value, and a write is guaranteed to supersede the earlier write.
Access by process C that has no causal relationship to process A is subject
to the normal eventual consistency rules.

Read-your-writes consistency. This is an important model where process A,
after it has updated a data item, always accesses the updated value and
will never see an older value. This is a special case of the causal
consistency model."

[3]
http://www.postgresql.org/message-id/flat/CAOG9ApHYCPmTypAAwfD3_V7sVOkbnECFivmRc1AxhB40ZBSwNQ(at)mail(dot)gmail(dot)com

[4]
http://www.postgresql.org/message-id/flat/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4ASQ(at)mail(dot)gmail(dot)com

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
causal-reads-poc.patch application/octet-stream 67.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-11-11 06:04:40 Re: Proposal: Trigonometric functions in degrees
Previous Message Noah Misch 2015-11-11 04:22:47 Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()