Re: Causal reads take II

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Causal reads take II
Date: 2017-07-31 05:49:34
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> I looked through the code of `synchronous-replay-v1.patch` a bit and ran a
> few
> tests. I didn't manage to break anything, except one mysterious error that
> I've
> got only once on one of my replicas, but I couldn't reproduce it yet.
> Interesting thing is that this error did not affect another replica or
> primary.
> Just in case here is the log for this error (maybe you can see something
> obvious, that I've not noticed):
> ```
> LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
> Directory not empty
> CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211":
> Directory not empty
> CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
> Directory not empty
> CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211":
> Directory not empty
> CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> LOG: directories for tablespace 47733 could not be removed
> HINT: You can remove the directories manually if necessary.
> CONTEXT: WAL redo at 0/125F4D90 for Tablespace/DROP: 47733
> FATAL: could not create directory "pg_tblspc/47734/PG_10_201707211/47732":
> File exists
> CONTEXT: WAL redo at 0/125F5768 for Storage/CREATE:
> pg_tblspc/47734/PG_10_201707211/47732/47736
> LOG: startup process (PID 8034) exited with exit code 1
> LOG: terminating any other active server processes
> LOG: database system is shut down
> ```

Hmm. The first error ("could not remove directory") could perhaps be
explained by temporary files from concurrent backends, leaked files
from earlier crashes or copying a pgdata directory over the top of an
existing one as a way to set it up, leaving behind some files from an
earlier test? The second error ("could not create directory") is a
bit stranger though... I think this must come from
TablespaceCreateDbspace(): it must have stat()'d the file and got
ENOENT, decided to create the directory, acquired
TablespaceCreateLock, stat()'d the file again and found it still
absent, then run mkdir() on the parents and got EEXIST and finally on
the directory to be created, and surprisingly got EEXIST. That means
that someone must have concurrently created the directory. Perhaps in
your testing you accidentally copied a pgdata directory over the top
of it while it was running? In any case I'm struggling to see how
anything in this patch would affect anything at the REDO level.

> And speaking about the code, so far I have just a few notes (some of them
> merely questions):
> * In general the idea behind this patch sounds interesting for me, but it
> relies heavily on time synchronization. As mentioned in the documentation:
> "Current hardware clocks, NTP implementations and public time servers are
> unlikely to allow the system clocks to differ more than tens or hundreds
> of
> milliseconds, and systems synchronized with dedicated local time servers
> may
> be considerably more accurate." But as far as I remember from my own
> experience sometimes it maybe not that trivial on something like AWS
> because
> of virtualization. Maybe it's an unreasonable fear, but is it possible to
> address this problem somehow?

Oops, I had managed to lose an important hunk that deals with
detecting excessive clock drift (ie badly configured servers) while
rebasing a couple of versions back. Here is a version to put it back.

With that change, if you disable NTP and manually set your standby's
clock to be more than 1.25s (assuming synchronous_replay_lease_time is
set to the default of 5s) behind the primary, the synchronous_replay
should be unavailable and you should see this error in the standby's

(errmsg("the primary server's clock time is too
far ahead for synchronous_replay"),
errhint("Check your servers' NTP configuration or

One way to test this without messing with your NTP setting or
involving two different computers is to modify this code temporarily
in WalSndKeepalive:

now = GetCurrentTimestamp() + 1250100;

This is a best effort intended to detect a system not running ntpd at
all or talking to an insane time server. Fundamentally this proposal
is based on the assumption that you can get your system clocks into
sync within a tolerance that we feel confident estimating an upper
bound for.

It does appear that some Amazon OS images come with NTP disabled;
that's a problem if you want to use this feature, but if you're
running a virtual server without an ntpd you'll pretty soon drift
seconds to minutes off UTC time and get "unavailable for synchronous
replay" errors from this patch (and possibly the LOG message above,
depending on the direction of drift).

> * Also I noticed that some time-related values are hardcoded (e.g. 50%/25%
> time shift when we're dealing with leases). Does it make sense to move
> them
> out and make them configurable?

These numbers are interrelated, and I think they're best fixed in that
ratio. You could make it more adjustable, but I think it's better to
keep it simple with just a single knob. Let me restate that logic to
explain how I came up with those ratios. There are two goals:

1. The primary needs to be able to wait for a lease to expire if
connectivity is lost, even if the the standby's clock is behind the
primary's clock by max_clock_skew. (Failure to do so could allow
stale query results from a zombie standby that is somehow still
handling queries after connectivity with the primary is lost.)

2. The primary needs to be able to replace leases often enough so
that there are no gaps between them, even if the standby's clock is
ahead of the primary's clock by max_clock_skew. (Failure to replace
leases fast enough could cause spurious "unavailable" errors, but not
incorrect query results. Technically it's max_clock_skew - network
latency since it takes time for new leases to arrive but that's a
minor detail).

A solution that maximises tolerable clock skew and as an added bonus
doesn't require the standby to have access to the primary's GUCs is to
tell the standby that the expiry time is 25% earlier that the time the
primary will really wait until, and replace leases when they still
have 50% of their time to go.

To illustrate using fixed-width ASCII-art, here's how the primary
perceives the stream of leases it sends. In this diagram, '+' marks
halfway, '!' marks the time the primary will send to the standby as
the expiry time, and the final '|' marks the time the primary will
really wait until if it has to, just in case the standby's clock is
'slow' (behind). The '!' is 25% earlier, and represents the maximum
tolerable clock skew.


Here's how a standby with a clock that is 'slow' (behind) by
max_clock_skew = 25% perceives this stream of leases:


You can see that the primary server is able to wait just long enough
for the lease to expire and the error to begin to be raised on this
standby server, if it needs to.

Here's how a standby with a clock that is 'fast' (ahead) by
max_clock_skew = 25% perceives this stream of leases:


If it's ahead by more than that, we'll get gaps where the error may be
raised spuriously in between leases.

> * Judging from the `SyncReplayPotentialStandby` function, it's possible to
> have
> `synchronous_replay_standby_names = "*, server_name"`, which is basically
> an
> equivalent for just `*`, but it looks confusing. Is it worth it to prevent
> this behaviour?

Hmm. Seems harmless to me!

> * In the same function `SyncReplayPotentialStandby` there is this code:
> ```
> if (!SplitIdentifierString(rawstring, ',', &elemlist))
> {
> /* syntax error in list */
> pfree(rawstring);
> list_free(elemlist);
> /* GUC machinery will have already complained - no need to do again */
> return false;
> }
> ```
> Am I right that ideally this (a situation when at this point in the code
> `synchronous_replay_standby_names` has incorrect value) should not happen,
> because GUC will prevent us from that? If yes, then it looks for me that
> it
> still makes sense to put here a log message, just to give more information
> in
> a potentially weird situation.

Yes. That's exactly the coding that was used for synchronous_commit,
before it was upgraded to support a new fancy syntax. I was trying to
do things the established way.

> * In the function `SyncRepReleaseWaiters` there is a commentary:
> ```
> /*
> * If the number of sync standbys is less than requested or we aren't
> * managing a sync standby or a standby in synchronous replay state that
> * blocks then just leave.
> * /
> if ((!got_recptr || !am_sync) && !walsender_sr_blocker)
> ```
> Is this commentary correct? If I understand everything right `!got_recptr`
> -
> the number of sync standbys is less than requested (a), `!am_sync` - we
> aren't
> managing a sync standby (b), `walsender_sr_blocker` - a standby in
> synchronous
> replay state that blocks (c). Looks like condition is `(a or b) and not
> c`.

This code is trying to decide whether to leave early, rather than
potentially blocking. The change in my patch is:

- if (!got_recptr || !am_sync)
+ if ((!got_recptr || !am_sync) && !walsender_sr_blocker)

The old coding said "if there aren't enough sync commit standbys, or
I'm not a sync standby, then I can leave now". The coding with my
patch is the same, except that in any case it won't leave early this
walsender is managing a standby that potentially blocks commit. That
said, it's a terribly named and documented function argument, so I
have fixed that in the attached version; I hope that's better!

To put it another way, with this patch there are two different reasons
a transaction might need to wait: because of synchronous_commit and
because of synchronous_replay. They're both forms of 'synchronous
replication' I suppose. That if statement is saying 'if I don't need
to wait for synchronous_commit, and I don't need to wait for
synchronous_replay, then we can return early'.

> * In the function `ProcessStandbyReplyMessage` there is a code that
> implements
> this:
> ```
> * If the standby reports that it has fully replayed the WAL for at
> least
> * 10 seconds, then let's clear the lag times that were measured when it
> * last wrote/flushed/applied a WAL record. This way we avoid displaying
> * stale lag data until more WAL traffic arrives.
> ```
> but I never found any mention of this 10 seconds in the documentation. Is
> it
> not that important? Also, question 2 is related to this one.

Hmm. Yeah that does seem a bit arbitrary. The documentation in
master does already say that it's cleared without being saying exactly

[...] If the standby
server has entirely caught up with the sending server and there is no more
WAL activity, the most recently measured lag times will continue to be
displayed for a short time and then show NULL.

The v1 patch changeed it from being based on
wal_receiver_status_interval (sort of implicitly) to being 10 seconds,
and yeah that is not a good change. In this v2 it's using
wal_receiver_status_interval (though it has to do it explicitly, since
this patch increases the amount of chit chat between the servers due
to lease replacement).

> * In the function `WalSndGetSyncReplayStateString` all the states are in
> lower
> case except `UNKNOWN`, is there any particular reason for that?

This state should never exist, so no user will ever see it; I used
"UNKNOWN" following the convention established by the function
WalSndGetStateString() immediately above.

> There are also few more not that important notes (mostly about some typos
> and
> few confusing names), but I'm going to do another round of review and
> testing
> anyway so I'll just send them all next time.

Thanks for the review!

Thomas Munro

Attachment Content-Type Size
synchronous-replay-v2.patch application/octet-stream 78.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2017-07-31 06:08:58 Persistent wait event sets and socket changes
Previous Message Ashutosh Bapat 2017-07-31 05:49:00 Re: Adding support for Default partition in partitioning