Re: Proposal: "Causal reads" mode for load balancing reads without stale data

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Date: 2016-03-03 06:34:08
Message-ID: CAB7nPqSbCHF-fZjMxFZ3uO22V4Vp8NqAQ37YxmTRuExDDwTjkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 1, 2016 at 11:53 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>
>> Hi,
>>
>> On 2016/02/29 18:05, Thomas Munro wrote:
>>> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote:
>>>> + servers. A transaction that is run with
>>>> <varname>causal_reads</> set
>>>> + to <literal>on</> is guaranteed either to see the effects of all
>>>> + completed transactions run on the primary with the setting on, or to
>>>> + receive an error "standby is not available for causal reads".
>>>>
>>>> "A transaction that is run" means "A transaction that is run on a
>>>> standby", right?
>>>
>>> Well, it could be any server, standby or primary. Of course standbys
>>> are the interesting case since it it was already true that if you run
>>> two sequential transactions run on the primary, the second can see the
>>> effect of the first, but I like the idea of a general rule that
>>> applies anywhere, allowing you not to care which server it is.
>>
>> I meant actually in context of that sentence only.
>
> Ok, here's a new version that includes that change, fixes a conflict
> with recent commit 10b48522 and removes an accidental duplicate copy
> of the README file.

Finally I got my eyes on this patch.

To put it short, this patch introduces two different concepts:
- addition of a new value, remote_apply, for synchronous_commit, which
is actually where things overlap a bit with the N-sync patch, because
by combining the addition of remote_apply with the N-sync patch, it is
possible to ensure that an LSN is applied to multiple targets instead
of one now. In this case, as the master will wait for the LSN to be
applied on the sync standby, there is no need for the application to
have error handling in case a read transaction is happening on the
standby as the change is already visible on the standby when
committing it on master. However the cost here is that if the standby
node lags behind, it puts some extra wait as well on the master side.
- casual reads, which makes the master able to drop standbys that are
lagging too much behind and let callers of standbys know that it is
lagging to much behind and cannot satisfy causal reads. In this case
error handling is needed by the application in case a standby is
lagging to much behind.

That's one of my concerns about this patch now: it is trying to do too
much. I think that there is definitely a need for both things:
applications may be fine to pay the lagging price when remote_apply is
used by not having an extra error handling layer, or they cannot
accept a perhaps large of lag and are ready to have an extra error
handling layer to manage read failures on standbys. So I would
recommend a split to begin with:
1) Addition of remote_apply in synchronous_commit, which would be
quite a simple patch, and I am convinced that there are benefits in
having that. Compared to the previous patch sent, a standby message is
sent back to the master once COMMIT has been applied, accelerating
things a bit.
2) Patch for causal reads, with all its extra parametrization logic
and stuff to select standby candidates.

Here is as well a more detailed review...

Regression tests are failing, rules.sql is generating diffs because
pg_stat_replication is changed.

CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE,
COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also
call XLogRequestWalReceiverReply() when needed.

The new parameters are missing from postgresql.conf.sample.

+static bool
+SyncRepCheckEarlyExit(void)
+{
Doing this refactoring would actually make sense as a separate patch I
think, and that would simplify the core patch for causal reads.

+For this reason we say that the causal reads guarantee only holds as
+long as the absolute difference between the system clocks of the
+machines is no more than max_clock_skew. The theory is that NTP makes
+it possible to reason about the maximum possible clock difference
+between machines and choose a value that allows for a much larger
+difference. However, we do make a best effort attempt to detect
+misconfigured systems as described above, to catch the case of servers
+not running ntp a correctly configured ntp daemon, or with a clock so
+far out of whack that ntp refuses to fix it.
Just wondering how this reacts when standby and master are on
different timezones. I know of two ways to measure WAL lag: one is
what you are doing, by using a timestamp and rely on the two servers
to be in sync at clock level. The second is in bytes with a WAL
quantity, though it is less user-friendly to set up, say max_wal_lag
or similar, symbolized by the number of WAL segments the standby is
lagging behind, the concerns regarding clock sync across nodes go
away. To put it short, I find the timestamp approach easy to set up
and understand for the user, but not really solid as it depends much
on the state dependency between different hosts, while a difference
between flush and apply LSN positions is a quite independent concept.
So basically what would be used as a base comparison is not the
timestamp of the transaction commit but the flush LSN at the moment
commit has been replayed.

+ /*
+ * If a causal_reads_timeout is configured, it is used instead of
+ * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 +
+ * allowance for network latency, but since walreceiver can become quite
+ * bogged down fsyncing WAL we allow more tolerance. (This could be
+ * tightened up once standbys hand writing off to the WAL writer).
+ */
+ if (causal_reads_timeout != 0)
+ allowed_time = causal_reads_timeout;
+ else
+ allowed_time = wal_sender_timeout;
I find that surprising, for two reasons:
1) it seems to me that causal_read_timeout has in concept no relation
with WAL sender process control.
2) A standby should still be able to receive WAL even if it cannot
satisfy causal reads to give it a chance to catch up faster the amount
it is late.

- SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */
+ SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */
+ SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote
+ * apply */
+ SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote
+ apply with causal consistency */
SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a
typo s/flusha/flush a/.

I am still digging into the patch, the available/joining/unavailable
logic being quite interesting.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-03-03 07:00:29 Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Previous Message Tom Lane 2016-03-03 06:10:58 Re: postgres_fdw vs. force_parallel_mode on ppc