Synchronous replay take III

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Synchronous replay take III
Date: 2018-03-01 01:39:10
Message-ID: CAEepm=0Q6kCKMYFBN+Vv2frPc=3cS3T1MPOxnZ9do8+NHzoJTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I was pinged off-list by a fellow -hackers denizen interested in the
synchronous replay feature and wanting a rebased patch to test. Here
it goes, just in time for a Commitfest. Please skip to the bottom of
this message for testing notes.

In previous threads[1][2][3] I called this feature proposal "causal
reads". That was a terrible name, borrowed from MySQL. While it is
probably a useful term of art, for one thing people kept reading it as
"casual", which it ain't, and more importantly this patch is only one
way to achieve read-follows-write causal consistency. Several others
are proposed or exist in forks (user managed wait-for-LSN, global
transaction manager, ...).

OVERVIEW

For writers, it works a bit like RAID mirroring: when you commit a
write transaction, it waits until the data has become visible on all
elements of the array, and if an array element is not responding fast
enough it is kicked out of the array. For readers, it's a little
different because you're connected directly to the array elements
(rather than going through a central controller), so it uses a system
of leases allowing read transactions to know instantly and whether
they are running on an element that is currently in the array and are
therefore able to service synchronous_replay transactions, or should
raise an error telling you to go and ask some other element.

This is a design choice favouring read-mostly workloads at the expense
of write transactions. Hot standbys' whole raison for existing is to
move *some* read-only workloads off the primary server. This proposal
is for users who are prepared to trade increased primary commit
latency for a guarantee about visibility on the standbys, so that
*all* read-only work could be moved to hot standbys.

The guarantee is: When two transactions tx1, tx2 are run with
synchronous_replay set to on and tx1 reports successful commit before
tx2 begins, then tx1 is guaranteed either to see tx1 or to raise a new
error 40P02 if it is run on a hot standby. I have joked that that
error means "snapshot too young". You could handle it the same way
you handle deadlocks and serialization failures: by retrying, except
in this case you might want to avoid that node for a while.

Note that this feature is concerned with transaction visibility. It
is not concerned with transaction durability. It will happily kick
all of your misbehaving or slow standbys out of the array so that you
fall back to single-node commit durability. You can express your
durability requirement (ie I must have have N copies of the data on
disk before I tell any external party about a transaction) separately,
by configuring regular synchronous replication alongside this feature.
I suspect that this feature would be most popular with people who are
already using regular synchronous replication though, because they
already tolerate higher commit latency.

STATUS

Here's a quick summary of the status of this proposal as I see it:

* Simon Riggs, as the committer most concerned with the areas this
proposal touches -- namely streaming replication and specifically
syncrep -- has not so far appeared to be convinced by the value of
this approach, and has expressed a preference for pursuing client-side
or middleware tracked LSN tokens exclusively. I am perceptive enough
to see that failing to sell the idea to Simon is probably fatal to the
proposal. The main task therefore is to show convincingly that there
is a real use case for this high-level design and its set of
trade-offs, and that it justifies its maintenance burden.

* I have tried to show that there are already many users who route
their read-only queries to hot standby databases (not just "reporting
queries"), and libraries and tools to help people do that using
heuristics like "logged in users need fresh data, so primary only" or
"this session has written in the past N minutes, so primary only".
This proposal would provide a way for those users to do something
based on a guarantee instead of such flimsy heuristics. I have tried
to show that the libraries used by Python, Ruby, Java etc to achieve
that sort of load balancing should easily be able to handle finding
read-only nodes, routing read-only queries and dealing with the new
error. I do also acknowledge that such libraries could also be used
to provide transparent read-my-writes support by tracking LSNs and
injecting wait-for-LSN directives with alternative proposals, but that
is weaker than a global reads-follow-writes guarantee and the
difference can matter.

* I have argued that token-based systems are in fact rather
complicated[4] and by no means a panacea. As usual, there are a whole
bunch of trade-offs. I suspect that this proposal AND fully
user-managed causality tokens (no middleware) are both valuable sweet
spots for a non-GTM system.

* Ants Aasma pointed out that this proposal doesn't provide a
read-follows-read guarantee. He is right, and I'm not sure to what
extent that is a problem, but I also think token-based systems can
probably only solve it with fairly high costs.

* Dmitry Dolgov reported a bug causing the replication protocol to get
corrupted on some OSs but not others[5]; could be uninitialised data
or size/padding/layout thinko or other stupid problem. (Gee, it would
be nice if the wire protocol writing and reading code were in reusable
functions instead of open-coded in multiple places... the bug could
be due to that). Unfortunately I haven't managed to track it down yet
and haven't had time to get back to this in time for the Commitfest
due to other work. Given the interest expressed by a reviewer to test
this, which might result in that problem being figured out, I figured
I might as well post the rebased patch anyway, and I will also have
another look soon.

* As Andres Freund pointed out, this currently lacks tests. It should
be fairly easy to add TAP tests to exercise this code, in the style of
the existing tests for replication.

TESTING NOTES

Set up some hot standbys, put synchronous_replay_max_lag = 2s in the
primary's postgresql.conf, then set synchronous_replay = on in every
postgresql.conf or at least in every session that you want to test
with. Then generate various write workloads and observe the primary
server's log as the leases are grant and revoke, or check the status
in pg_stat_replication's replay_lag and sync_replay columns. Verify
that you can't successfully run synchronous_replay = on transactions
on standbys that don't currently have a lease, and that you can't
trick it by cutting your network cables with scissors or killing
random processes etc. You might want to verify my claims about clock
drift and the synchronous_replay_lease_time, either mathematically or
experimentally.

Thanks for reading!

[1] https://www.postgresql.org/message-id/flat/CAEepm%3D0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk%3DzNXA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=zNXA@mail.gmail.com
https://www.postgresql.org/message-id/flat/CAEepm%3D1iiEzCVLD%3DRoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/CA%2BCSw_tz0q%2BFQsqh7Zx7xxF99Jm98VaAWGdEP592e7a%2BzkD_Mw%40mail.gmail.com
[4] https://www.postgresql.org/message-id/CAEepm%3D0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ%40mail.gmail.com
[5] https://www.postgresql.org/message-id/CAEepm%3D352uctNiFoN84UN4gtunbeTK-PBLouVe8i_b8ZPcJQFQ%40mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
0001-Synchronous-replay-mode-for-avoiding-stale-reads--v5.patch application/octet-stream 83.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-03-01 01:59:52 Re: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary
Previous Message Amit Langote 2018-03-01 01:27:54 Re: [HACKERS] path toward faster partition pruning