Re: Proposal: "Causal reads" mode for load balancing reads without stale data

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Date: 2016-04-05 23:21:46
Message-ID: CAEepm=31yndQ7S5RdGofoGz1yQ-cteMrePR2JLf9gWGzxKcV7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 5, 2016 at 4:17 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Mar 30, 2016 at 2:22 AM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> On Wed, Mar 30, 2016 at 2:36 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> OK, I committed this, with a few tweaks. In particular, I added a
>>> flag variable instead of relying on "latch set" == "need to send
>>> reply"; the other changes were cosmetic.
>>>
>>> I'm not sure how much more of this we can realistically get into 9.6;
>>> the latter patches haven't had much review yet. But I'll set this
>>> back to Needs Review in the CommitFest and we'll see where we end up.
>>> But even if we don't get anything more than this, it's still rather
>>> nice: remote_apply turns out to be only slightly slower than remote
>>> flush, and it's a guarantee that a lot of people are looking for.
>>
>> Thank you Michael and Robert!
>>
>> Please find attached the rest of the patch series, rebased against
>> master. The goal of the 0002 patch is to provide an accurate
>> indication of the current replay lag on each standby, visible to users
>> like this:
>>
>> postgres=# select application_name, replay_lag from pg_stat_replication;
>> application_name │ replay_lag
>> ──────────────────┼─────────────────
>> replica1 │ 00:00:00.000299
>> replica2 │ 00:00:00.000323
>> replica3 │ 00:00:00.000319
>> replica4 │ 00:00:00.000303
>> (4 rows)
>>
>> It works by maintaining a buffer of (end of WAL, time now) samples
>> received from the primary, and then eventually feeding those times
>> back to the primary when the recovery process replays the
>> corresponding locations.
>>
>> Compared to approaches based on commit timestamps, this approach has
>> the advantage of providing non-misleading information between commits.
>> For example, if you run a batch load job that takes 1 minute to insert
>> the whole phonebook and no other transactions run, you will see
>> replay_lag updating regularly throughout that minute, whereas typical
>> commit timestamp-only approaches will show an increasing lag time
>> until a commit record is eventually applied. Compared to simple LSN
>> location comparisons, it reports in time rather than bytes of WAL,
>> which can be more meaningful for DBAs.
>>
>> When the standby is entirely caught up and there is no write activity,
>> the reported time effectively represents the ping time between the
>> servers, and is updated every wal_sender_timeout / 2, when keepalive
>> messages are sent. While new WAL traffic is arriving, the walreceiver
>> records timestamps at most once per second in a circular buffer, and
>> then sends back replies containing the recorded timestamps as fast as
>> the recovery process can apply the corresponding xlog. The lag number
>> you see is computed by the primary server comparing two timestamps
>> generated by its own system clock, one of which has been on a journey
>> to the standby and back.
>>
>> Accurate lag estimates are a prerequisite for the 0004 patch (about
>> which more later), but I believe users would find this valuable as a
>> feature on its own.
>
> Well, one problem with this is that you can't put a loop inside of a
> spinlock-protected critical section.
>
> In general, I think this is a pretty reasonable way of attacking this
> problem, but I'd say it's significantly under-commented. Where should
> someone go to get a general overview of this mechanism? The answer is
> not "at place XXX within the patch". (I think it might merit some
> more extensive documentation, too, although I'm not exactly sure what
> that should look like.)
>
> When you overflow the buffer, you could thin in out in a smarter way,
> like by throwing away every other entry instead of the oldest one. I
> guess you'd need to be careful how you coded that, though, because
> replaying an entry with a timestamp invalidates some of the saved
> entries without formally throwing them out.
>
> Conceivably, 0002 could be split into two patches, one of which
> computes "stupid replay lag" considering only records that naturally
> carry timestamps, and a second adding the circular buffer to handle
> the case where much time passes without finding such a record.

Thanks. I see a way to move that loop and change the overflow
behaviour along those lines but due to other commitments I won't be
able to post a well tested patch and still leave time for reviewers
and committer before the looming deadline. After the freeze I will
post an updated version that addresses these problems for the next CF.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-04-05 23:26:23 Re: Unused macros in src/include/access/transam.h
Previous Message Michael Paquier 2016-04-05 23:17:47 Re: [PATCH v12] GSSAPI encryption support