Re: Proposal: "Causal reads" mode for load balancing reads without stale data

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>, Thom Brown <thom(at)linux(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Date: 2016-04-04 16:17:34
Message-ID: CA+TgmoY9NWeKmmk+SDFYfX5XzzGnV4A52Uc0gi4A0-DZznn+4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 30, 2016 at 2:22 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Wed, Mar 30, 2016 at 2:36 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> OK, I committed this, with a few tweaks. In particular, I added a
>> flag variable instead of relying on "latch set" == "need to send
>> reply"; the other changes were cosmetic.
>>
>> I'm not sure how much more of this we can realistically get into 9.6;
>> the latter patches haven't had much review yet. But I'll set this
>> back to Needs Review in the CommitFest and we'll see where we end up.
>> But even if we don't get anything more than this, it's still rather
>> nice: remote_apply turns out to be only slightly slower than remote
>> flush, and it's a guarantee that a lot of people are looking for.
>
> Thank you Michael and Robert!
>
> Please find attached the rest of the patch series, rebased against
> master. The goal of the 0002 patch is to provide an accurate
> indication of the current replay lag on each standby, visible to users
> like this:
>
> postgres=# select application_name, replay_lag from pg_stat_replication;
> application_name │ replay_lag
> ──────────────────┼─────────────────
> replica1 │ 00:00:00.000299
> replica2 │ 00:00:00.000323
> replica3 │ 00:00:00.000319
> replica4 │ 00:00:00.000303
> (4 rows)
>
> It works by maintaining a buffer of (end of WAL, time now) samples
> received from the primary, and then eventually feeding those times
> back to the primary when the recovery process replays the
> corresponding locations.
>
> Compared to approaches based on commit timestamps, this approach has
> the advantage of providing non-misleading information between commits.
> For example, if you run a batch load job that takes 1 minute to insert
> the whole phonebook and no other transactions run, you will see
> replay_lag updating regularly throughout that minute, whereas typical
> commit timestamp-only approaches will show an increasing lag time
> until a commit record is eventually applied. Compared to simple LSN
> location comparisons, it reports in time rather than bytes of WAL,
> which can be more meaningful for DBAs.
>
> When the standby is entirely caught up and there is no write activity,
> the reported time effectively represents the ping time between the
> servers, and is updated every wal_sender_timeout / 2, when keepalive
> messages are sent. While new WAL traffic is arriving, the walreceiver
> records timestamps at most once per second in a circular buffer, and
> then sends back replies containing the recorded timestamps as fast as
> the recovery process can apply the corresponding xlog. The lag number
> you see is computed by the primary server comparing two timestamps
> generated by its own system clock, one of which has been on a journey
> to the standby and back.
>
> Accurate lag estimates are a prerequisite for the 0004 patch (about
> which more later), but I believe users would find this valuable as a
> feature on its own.

Well, one problem with this is that you can't put a loop inside of a
spinlock-protected critical section.

In general, I think this is a pretty reasonable way of attacking this
problem, but I'd say it's significantly under-commented. Where should
someone go to get a general overview of this mechanism? The answer is
not "at place XXX within the patch". (I think it might merit some
more extensive documentation, too, although I'm not exactly sure what
that should look like.)

When you overflow the buffer, you could thin in out in a smarter way,
like by throwing away every other entry instead of the oldest one. I
guess you'd need to be careful how you coded that, though, because
replaying an entry with a timestamp invalidates some of the saved
entries without formally throwing them out.

Conceivably, 0002 could be split into two patches, one of which
computes "stupid replay lag" considering only records that naturally
carry timestamps, and a second adding the circular buffer to handle
the case where much time passes without finding such a record.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-04-04 16:36:00 Re: [PROPOSAL] Client Log Output Filtering
Previous Message Alvaro Herrera 2016-04-04 16:08:01 Re: BUG #13755: pgwin32_is_service not checking if SECURITY_SERVICE_SID is disabled