Re: Proposal: "Causal reads" mode for load balancing reads without stale data

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Date: 2016-03-09 05:07:03
Message-ID: CAEepm=0eMpTy1a8b=FVtGZBFxnQkFoO1dU0ib=UrbKMwkUMKqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> Finally I got my eyes on this patch.

Thanks, I really appreciate your time and interest. Your feedback
gave me plenty to chew on. See below for replies to both your recent
emails.

> To put it short, this patch introduces two different concepts:
> - addition of a new value, remote_apply, for synchronous_commit, which
> is actually where things overlap a bit with the N-sync patch, because
> by combining the addition of remote_apply with the N-sync patch, it is
> possible to ensure that an LSN is applied to multiple targets instead
> of one now. In this case, as the master will wait for the LSN to be
> applied on the sync standby, there is no need for the application to
> have error handling in case a read transaction is happening on the
> standby as the change is already visible on the standby when
> committing it on master. However the cost here is that if the standby
> node lags behind, it puts some extra wait as well on the master side.
> - casual reads, which makes the master able to drop standbys that are
> lagging too much behind and let callers of standbys know that it is
> lagging to much behind and cannot satisfy causal reads. In this case
> error handling is needed by the application in case a standby is
> lagging to much behind.
>
> That's one of my concerns about this patch now: it is trying to do too
> much. I think that there is definitely a need for both things:
> applications may be fine to pay the lagging price when remote_apply is
> used by not having an extra error handling layer, or they cannot
> accept a perhaps large of lag and are ready to have an extra error
> handling layer to manage read failures on standbys. So I would
> recommend a split to begin with:
> 1) Addition of remote_apply in synchronous_commit, which would be
> quite a simple patch, and I am convinced that there are benefits in
> having that. Compared to the previous patch sent, a standby message is
> sent back to the master once COMMIT has been applied, accelerating
> things a bit.
> 2) Patch for causal reads, with all its extra parametrization logic
> and stuff to select standby candidates.

Agreed. I have split this work up into four patches which apply on
top of each other, and provide something (hopefully) useful at each
stage. Namely:

0001-remote-apply.patch: This adds synchronous_commit = remote_apply.
It works by setting a new bit in commit records to say that the
primary requests feedback when the record is applied. When the
recovery process sees this bit, it wakes the walreceiver process and
asks it to send a reply to the primary ASAP.

0002-replay-lag.patch: This adds a time-based estimate of the current
apply lag on each standby. The value is visible in
pg_stat_replication. It uses a combination of commit timestamps and
periodic LSN->time samples from keepalive messages, so that replay lag
is computed even when there are no commits, for example during bulk
data loads. This builds on the previous patch, because it makes use
of the mechanism whereby the recovery process can ask the walreceiver
to send replies whenever it applies WAL records for which it has
timestamps.

0003-refactor-syncrep-exit.patch: This moves the code to handle
syncrep wait cancellation into a function, as you suggested.
Conceptually independent of the previous two patches, but presented as
a patch on top of the previous patches and separately from
causal-reads.patch for ease of review.

0004-causal-reads.patch: This adds the causal reads feature, building
on top of the other three patches. From the remote-apply patch it
uses the mechanism for asking for apply feedback. From the replay-lag
patch it uses the mechanism for tracking apply lag. It also uses
syncrep wait cancellation code.

> Here is as well a more detailed review...
>
> Regression tests are failing, rules.sql is generating diffs because
> pg_stat_replication is changed.

Fixed, thanks.

> CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE,
> COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also
> call XLogRequestWalReceiverReply() when needed.

Ah yes, agreed for COMMIT PREPARED. Fixed.

But why for PREPARE TRANSACTION or ROLLBACK PREPARED? Those don't
generate visible effects that others might expect to see on a standby,
so I don't think that is necessary.

> The new parameters are missing from postgresql.conf.sample.

Fixed, thanks.

> +static bool
> +SyncRepCheckEarlyExit(void)
> +{
> Doing this refactoring would actually make sense as a separate patch I
> think, and that would simplify the core patch for causal reads.

Agreed -- see 0003-refactor-syncrep-exit.patch.

> +For this reason we say that the causal reads guarantee only holds as
> +long as the absolute difference between the system clocks of the
> +machines is no more than max_clock_skew. The theory is that NTP makes
> +it possible to reason about the maximum possible clock difference
> +between machines and choose a value that allows for a much larger
> +difference. However, we do make a best effort attempt to detect
> +misconfigured systems as described above, to catch the case of servers
> +not running ntp a correctly configured ntp daemon, or with a clock so
> +far out of whack that ntp refuses to fix it.
>
> Just wondering how this reacts when standby and master are on
> different timezones. I know of two ways to measure WAL lag: one is
> what you are doing, by using a timestamp and rely on the two servers
> to be in sync at clock level. The second is in bytes with a WAL
> quantity, though it is less user-friendly to set up, say max_wal_lag
> or similar, symbolized by the number of WAL segments the standby is
> lagging behind, the concerns regarding clock sync across nodes go
> away. To put it short, I find the timestamp approach easy to set up
> and understand for the user, but not really solid as it depends much
> on the state dependency between different hosts, while a difference
> between flush and apply LSN positions is a quite independent concept.
> So basically what would be used as a base comparison is not the
> timestamp of the transaction commit but the flush LSN at the moment
> commit has been replayed.

I'm not actually using either of those approaches. Postgres already
periodically captures (time, WAL end) pairs on the primary, and feeds
them to the standby. With this patch set, the standby records these
samples in a circular buffer, and when the recovery process eventually
applies sampled WAL positions, it feeds the associated timestamps back
to the primary in unsolicited keepalive reply messages. Then the
primary can judge how long it took the standby to apply the sampled
WAL position. It doesn't suffer from clock skew problems because it's
comparing two timestamps obtained from the system clock on the same
box. On the other hand, the time reported includes an extra network
latency term as the reply message has to reach the primary. That
might be a good thing, depending on your purpose, as it reflects the
time you'd be waiting for syncrep/causal reads on that standby, which
includes such latency.

I had to implement this because, in an early prototype of the causal
reads feature, the system would choke on large data load transactions
even though the standbys were able to keep up. Lag could only be
measured when commit records came along. That meant that a very big
data load was indistinguishable from a system not applying fast
enough, and the standby would drop to causal reads unavailable state
spuriously. I considered interpolating WAL positions in between
commit records, but couldn't see how to make it work. I considered
injecting extra timestamps into the WAL periodically, but I didn't
need anything persistent. I just needed a way to collect occasional
(time, LSN) samples in memory and feed the times back at the
appropriate moments as the WAL is replayed.

> + /*
> + * If a causal_reads_timeout is configured, it is used instead of
> + * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 +
> + * allowance for network latency, but since walreceiver can become quite
> + * bogged down fsyncing WAL we allow more tolerance. (This could be
> + * tightened up once standbys hand writing off to the WAL writer).
> + */
> + if (causal_reads_timeout != 0)
> + allowed_time = causal_reads_timeout;
> + else
> + allowed_time = wal_sender_timeout;
> I find that surprising, for two reasons:
> 1) it seems to me that causal_read_timeout has in concept no relation
> with WAL sender process control.
> 2) A standby should still be able to receive WAL even if it cannot
> satisfy causal reads to give it a chance to catch up faster the amount
> it is late.

This is only used to detect dead/disconnected/unreachable standbys,
not to handle standbys that are merely not keeping up with replication
traffic. To get disconnected this way, a standby has to be not
responding to keepalive pings. When dealing with a causal reads
standby, we want to cap the time we're prepared to hang around waiting
for a zombie standby before nuking it. If we didn't have that change,
then we would stop waiting for slow standbys (not applying fast
enough) in a timely fashion, but crashed/disconnected standbys would
make our commits hang for the potentially much longer
wal_sender_timeout, which seems silly. Zombie servers shouldn't be
able to interfere with your commits for longer periods of time than
lagging servers.

> - SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */
> + SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */
> + SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote
> + * apply */
> + SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote
> + apply with causal consistency */
> SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a
> typo s/flusha/flush a/.
>
> I am still digging into the patch, the available/joining/unavailable
> logic being quite interesting.

I hope the new patch set makes that job easier...

On Thu, Mar 3, 2016 at 8:00 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> Hm. Looking now at
> http://www.postgresql.org/message-id/CANP8+j+jCpNoOjc-KQLtt4PDyOX2Sq6wYWqCSy6aaHWkvNa0hw@mail.gmail.com
> it would be nice to get a clear solution for it first, though the use
> of signals to wake up the WAL receiver and enforce it to send a new
> LSN apply position back to the master to unlock it asap does not look
> very appealing. Seeing that no patch has been sent for 9.6 regarding
> that, it would be better to simply drop this code from the causal-read
> patch perhaps...

The signalling mechanism is still present. Since that discussion I
had the idea of using a different (non-overloaded) signal and
unblocking it only while actually waiting for the network, so there
should be no signal delivery interrupting blocking disk IO operations.
But I certainly agree that the disk IO should be moved out of the
walreceiver process and in to the WAL writer process, for several
reasons. This patch series doesn't need that to work though.

Here's a recap of how the signal is used, in the context of the
communication introduced by each patch:

0001-remote-apply.patch
* the primary sets a new bit XACT_COMPLETION_SYNC_APPLY_FEEDBACK in
commit records' xinfo when synchronous_commit = remote_apply
* the WAL record is transported walsender->walreceiver as usual
* the recovery process eventually applies the commit record, and, when
it sees that bit, it wakes the walreceiver with SIGUSR2
* when walreceiver receives the signal it generates a keepalive reply
to report the latest apply position
* the walsender uses that information to release syncrep waiters (just
as it always did)

0002-replay-lag.patch
* the walreceiver records (time, lsn) pairs into a circular buffer as
they arrive
* the recovery process, when it discovers that it has replayed an LSN
associated with a time in the circular buffer, uses the same signal as
used in 0001-remote-apply.patch to ask walreceiver to report this
progress
* the walreceiver includes the last replay timestamp in a new field
added to the reply message

0004-causal-reads.patch
* reuses the above to implement causal_reads mode:
XACT_COMPLETION_SYNC_APPLY_FEEDBACK is now also set if causal_reads is
on, because it too needs to know when commit records are applied
* the rest of the patch implements the new state machine, lease
control and waiting machinery

Some alternatives to that signal could include: a pipe between
recovery and walreceiver that could be multiplexed with network IO
(which may be tricky to implement cross-platform, I don't know, but I
don't see anything similar in Postgres), or a Postgres latch that
could be multiplexed with network IO using some future primitive as
suggested elsewhere (but would ultimately still involve a signal).
There would be plenty of scope for evolution, for example when
integrating with the standby WAL writer work, but the signal approach
seemed simple and effective for now and certainly isn't the first case
of backends signalling each other.

Thanks,

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
0001-remote-apply.patch application/octet-stream 16.5 KB
0002-replay-lag.patch application/octet-stream 25.6 KB
0003-refactory-syncrep-exit.patch application/octet-stream 4.6 KB
0004-causal-reads.patch application/octet-stream 73.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2016-03-09 05:39:37 Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Previous Message Noah Misch 2016-03-09 05:06:43 Re: Novice Presentation and Company Project