Re: Proposal: "Causal reads" mode for load balancing reads without stale data

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Date: 2016-03-06 23:12:03
Message-ID: CAEepm=3D_qNup8xFKNV_mdbX_z99S8pzWLZVXkN2hmzqx_3dkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 3, 2016 at 7:34 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Tue, Mar 1, 2016 at 11:53 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> On Tue, Mar 1, 2016 at 2:46 PM, Amit Langote
>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>
>>> Hi,
>>>
>>> On 2016/02/29 18:05, Thomas Munro wrote:
>>>> On Mon, Feb 29, 2016 at 9:05 PM, Amit Langote wrote:
>>>>> + servers. A transaction that is run with
>>>>> <varname>causal_reads</> set
>>>>> + to <literal>on</> is guaranteed either to see the effects of all
>>>>> + completed transactions run on the primary with the setting on, or to
>>>>> + receive an error "standby is not available for causal reads".
>>>>>
>>>>> "A transaction that is run" means "A transaction that is run on a
>>>>> standby", right?
>>>>
>>>> Well, it could be any server, standby or primary. Of course standbys
>>>> are the interesting case since it it was already true that if you run
>>>> two sequential transactions run on the primary, the second can see the
>>>> effect of the first, but I like the idea of a general rule that
>>>> applies anywhere, allowing you not to care which server it is.
>>>
>>> I meant actually in context of that sentence only.
>>
>> Ok, here's a new version that includes that change, fixes a conflict
>> with recent commit 10b48522 and removes an accidental duplicate copy
>> of the README file.
>
> Finally I got my eyes on this patch.
>
> To put it short, this patch introduces two different concepts:
> - addition of a new value, remote_apply, for synchronous_commit, which
> is actually where things overlap a bit with the N-sync patch, because
> by combining the addition of remote_apply with the N-sync patch, it is
> possible to ensure that an LSN is applied to multiple targets instead
> of one now. In this case, as the master will wait for the LSN to be
> applied on the sync standby, there is no need for the application to
> have error handling in case a read transaction is happening on the
> standby as the change is already visible on the standby when
> committing it on master. However the cost here is that if the standby
> node lags behind, it puts some extra wait as well on the master side.
> - casual reads, which makes the master able to drop standbys that are
> lagging too much behind and let callers of standbys know that it is
> lagging to much behind and cannot satisfy causal reads. In this case
> error handling is needed by the application in case a standby is
> lagging to much behind.
>
> That's one of my concerns about this patch now: it is trying to do too
> much. I think that there is definitely a need for both things:
> applications may be fine to pay the lagging price when remote_apply is
> used by not having an extra error handling layer, or they cannot
> accept a perhaps large of lag and are ready to have an extra error
> handling layer to manage read failures on standbys. So I would
> recommend a split to begin with:
> 1) Addition of remote_apply in synchronous_commit, which would be
> quite a simple patch, and I am convinced that there are benefits in
> having that. Compared to the previous patch sent, a standby message is
> sent back to the master once COMMIT has been applied, accelerating
> things a bit.
> 2) Patch for causal reads, with all its extra parametrization logic
> and stuff to select standby candidates.

Thanks. Yes, this makes a lot of sense. I have done some work on
splitting this up and will post the result soon, along with my
responses to your other feedback.

> Here is as well a more detailed review...
>
> Regression tests are failing, rules.sql is generating diffs because
> pg_stat_replication is changed.
>
> CausalReadsWaitForLSN() should be called for 2PC I think, for PREPARE,
> COMMIT PREPARED and ROLLBACK PREPARED. WAL replay for 2PC should also
> call XLogRequestWalReceiverReply() when needed.
>
> The new parameters are missing from postgresql.conf.sample.
>
> +static bool
> +SyncRepCheckEarlyExit(void)
> +{
> Doing this refactoring would actually make sense as a separate patch I
> think, and that would simplify the core patch for causal reads.
>
> +For this reason we say that the causal reads guarantee only holds as
> +long as the absolute difference between the system clocks of the
> +machines is no more than max_clock_skew. The theory is that NTP makes
> +it possible to reason about the maximum possible clock difference
> +between machines and choose a value that allows for a much larger
> +difference. However, we do make a best effort attempt to detect
> +misconfigured systems as described above, to catch the case of servers
> +not running ntp a correctly configured ntp daemon, or with a clock so
> +far out of whack that ntp refuses to fix it.
> Just wondering how this reacts when standby and master are on
> different timezones. I know of two ways to measure WAL lag: one is
> what you are doing, by using a timestamp and rely on the two servers
> to be in sync at clock level. The second is in bytes with a WAL
> quantity, though it is less user-friendly to set up, say max_wal_lag
> or similar, symbolized by the number of WAL segments the standby is
> lagging behind, the concerns regarding clock sync across nodes go
> away. To put it short, I find the timestamp approach easy to set up
> and understand for the user, but not really solid as it depends much
> on the state dependency between different hosts, while a difference
> between flush and apply LSN positions is a quite independent concept.
> So basically what would be used as a base comparison is not the
> timestamp of the transaction commit but the flush LSN at the moment
> commit has been replayed.
>
> + /*
> + * If a causal_reads_timeout is configured, it is used instead of
> + * wal_sender_timeout. Ideally we'd use causal_reads_timeout / 2 +
> + * allowance for network latency, but since walreceiver can become quite
> + * bogged down fsyncing WAL we allow more tolerance. (This could be
> + * tightened up once standbys hand writing off to the WAL writer).
> + */
> + if (causal_reads_timeout != 0)
> + allowed_time = causal_reads_timeout;
> + else
> + allowed_time = wal_sender_timeout;
> I find that surprising, for two reasons:
> 1) it seems to me that causal_read_timeout has in concept no relation
> with WAL sender process control.
> 2) A standby should still be able to receive WAL even if it cannot
> satisfy causal reads to give it a chance to catch up faster the amount
> it is late.
>
> - SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */
> + SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */
> + SYNCHRONOUS_COMMIT_REMOTE_APPLY, /* wait for local flush and remote
> + * apply */
> + SYNCHRONOUS_COMMIT_CONSISTENT_APPLY /* wait for local flusha and remote
> + apply with causal consistency */
> SYNCHRONOUS_COMMIT_CONSISTENT_APPLY is used nowhere, and there is a
> typo s/flusha/flush a/.
>
> I am still digging into the patch, the available/joining/unavailable
> logic being quite interesting.
> --
> Michael

--
Thomas Munro
http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2016-03-07 01:09:20 gzclose don't set properly the errno in pg_restore
Previous Message Tom Lane 2016-03-06 22:56:32 Re: character_not_in_repertoire vs. untranslatable_character