|From:||Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>|
|To:||Dmitry Dolgov <9erthalion6(at)gmail(dot)com>|
|Cc:||Pg Hackers <pgsql-hackers(at)postgresql(dot)org>|
|Subject:||Re: Causal reads take II|
|Views:||Raw Message | Whole Thread | Download mbox|
On Wed, May 24, 2017 at 3:58 PM, Thomas Munro
>> On Mon, May 22, 2017 at 4:10 AM, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
>>> I'm wondering about status of this patch and how can I try it out?
> I ran into a problem while doing this, and it may take a couple more
> days to fix it since I am at pgcon this week. More soon.
Apologies for the extended delay. Here is the rebased patch, now with
a couple of improvements (see below). To recap, this is the third
part of the original patch series, which had these components:
1. synchronous_commit = remote_apply, committed in PostgreSQL 9.6
2. replication lag tracking, committed in PostgreSQL 10
3. causal_reads, the remaining part, hereby proposed for PostgreSQL 11
The goal is to allow applications to move arbitrary read-only
transactions to physical replica databases and still know that they
can see all preceding write transactions or get an error. It's
something like regular synchronous replication with synchronous_commit
= remote_apply, except that it limits the impact on the primary and
handles failure transitions with defined semantics.
The inspiration for this kind of distributed read-follows-write
consistency using read leases was a system called Comdb2, whose
designer encouraged me to try to extend Postgres's streaming
replication to do something similar. Read leases can also be found in
some consensus systems like Google Megastore, albeit in more ambitious
form IIUC. The name is inspired by a MySQL Galera feature
(approximately the same feature but the approach is completely
different; Galera adds read latency, whereas this patch does not).
Maybe it needs a better name.
Is this is a feature that people want to see in PostgreSQL?
IMPROVEMENTS IN V17
The GUC to enable the feature is now called
"causal_reads_max_replay_lag". Standbys listed in
causal_reads_standby_names whose pg_stat_replication.replay_lag
doesn't exceed that time are "available" for causal reads and will be
waited for by the primary when committing. When they exceed that
threshold they are briefly in "revoking" state and then "unavailable",
and when the go return to an acceptable level they are briefly in
"joining" state before reaching "available". CR states appear in
pg_stat_replication and transitions are logged at LOG level.
A new GUC called "causal_reads_lease_time" controls the lifetime of
read leases sent from the primary to the standby. This affects the
frequency of lease replacement messages, and more importantly affects
the worst case of commit stall that can be introduced if connectivity
to a standby is lost and we have to wait for the last sent lease to
expire. In the previous version, one single GUC controlled both
maximum tolerated replay lag and lease lifetime, which was good from
the point of view that fewer GUCs are better, but bad because it had
to be set fairly high when doing both jobs to be conservative about
clock skew. The lease lifetime must be at least 4 x maximum tolerable
clock skew. After the recent botching of a leap-second transition on
a popular public NTP network (TL;DR OpenNTP is not a good choice of
implementation to add to a public time server pool) I came to the
conclusion that I wouldn't want to recommend a default max clock skew
under 1.25s, to allow for some servers to be confused about leap
seconds for a while or to be running different smearing algorithms. A
reasonable causal_reads_lease_time recommendation for people who don't
know much about the quality of their time source might therefore be
5s. I think it's reasonable to want to set the maximum tolerable
replay lag to lower time than that, or in fact as low as you like,
depending on your workload and hardware. Therefore I decided to split
the old "causal_reads_timeout" GUC into "causal_reads_max_replay_lag"
This new version introduces fast lease revocation. Whenever the
primary decides that a standby is not keeping up, it kicks it out of
the set of CR-available standbys and revokes its lease, so that anyone
trying to run causal reads transactions there will start receiving a
new error. In the previous version, it always did that by blocking
commits while waiting for the most recently sent lease to expire,
which I now call "slow revocation" because it could take several
seconds. Now it blocks commits only until the standby acknowledges
that it is no longer available for causal reads OR the lease expires:
ideally that takes the time of a network a round trip. Slow
revocation is still needed in various failure cases such as lost
Apply the patch after first applying a small bug fix for replication
lag tracking. Then:
1. Set up some streaming replicas.
2. Stick causal_reads_max_replay_lag = 2s (or any time you like) in
the primary's postgresql.conf.
3. Set causal_reads = on in some transactions on various nodes.
4. Try to break it!
As long as your system clocks don't disagree by more than 1.25s
(causal_reads_lease_time / 4), the causal reads guarantee will be
upheld: standbys will either see transactions that have completed on
the primary or raise an error to indicate that they are not available
for causal reads transactions. You should not be able to break this
guarantee, no matter what you do: unplug the network, kill arbitrary
If you mess with your system clocks so they differ by more than
causal_reads_lease_time / 4, you should see that a reasonable effort
is made to detect that so it's still very unlikely you can break it
(you'd need clocks to differ by more than causal_reads_lease_time / 4
but less than causal_reads_lease_time / 4 + network latency so that
the excessive skew is not detected, and then you'd need a very well
timed pair of transactions and loss of connectivity).
|Next Message||Peter Eisentraut||2017-06-23 13:00:01||Re: REPLICA IDENTITY FULL|
|Previous Message||Andrew Dunstan||2017-06-23 11:47:21||Re: intermittent failures in Cygwin from select_parallel tests|