max_standby_delay considered harmful

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: max_standby_delay considered harmful
Date: 2010-05-03 15:37:04
Message-ID: 16681.1272901024@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've finally wrapped my head around exactly what the max_standby_delay
code is doing, and I'm not happy with it. The way that code is designed
is that the setting represents a maximum allowed difference between the
standby server's system clock and the commit timestamps it is reading
from the WAL log; whenever this difference exceeds the setting, we'll
kill standby queries in hopes of catching up faster. Now, I can see
the attraction of defining it that way, for certain use-cases.
However, I think it is too fragile and too badly implemented to be
usable in the real world; and it certainly can't be the default
operating mode. There are three really fundamental problems with it:

1. The timestamps we are reading from the log might be historical,
if we are replaying from archive rather than reading a live SR stream.
In the current implementation that means zero grace period for standby
queries. Now if your only interest is catching up as fast as possible,
that could be a sane behavior, but this is clearly not the only possible
interest --- in fact, if that's all you care about, why did you allow
standby queries at all?

2. There could be clock skew between the master and slave servers.
If the master's clock is a minute or so ahead of the slave's, again we
get into a situation where standby queries have zero grace period, even
though killing them won't do a darn thing to permit catchup. If the
master is behind the slave then we have an artificially inflated grace
period, which is going to slow down the slave.

3. There could be significant propagation delay from master to slave,
if the WAL stream is being transmitted with pg_standby or some such.
Again this results in cutting into the standby queries' grace period,
for no defensible reason.

In addition to these fundamental problems there's a fatal implementation
problem: the actual comparison is not to the master's current clock
reading, but to the latest commit, abort, or checkpoint timestamp read
from the WAL. Thus, if the last commit was more than max_standby_delay
seconds ago, zero grace time. Now if the master is really idle then
there aren't going to be any conflicts anyway, but what if it's running
only long-running queries? Or what happens when it was idle for awhile
and then starts new queries? Zero grace period, that's what.

We could possibly improve matters for the SR case by having walsender
transmit the master's current clock reading every so often (probably
once per activity cycle), outside the WAL stream proper. The receiver
could subtract off its own clock reading in order to measure the skew,
and then we could cancel queries if the de-skewed transmission time
falls too far behind. However this doesn't do anything to fix the cases
where we aren't reading (and caught up to) a live SR broadcast.

I'm inclined to think that we should throw away all this logic and just
have the slave cancel competing queries if the replay process waits
more than max_standby_delay seconds to acquire a lock. This is simple,
understandable, and behaves the same whether we're reading live data or
not. Putting in something that tries to maintain a closed-loop maximum
delay between master and slave seems like a topic for future research
rather than a feature we have to have in 9.0. And in any case we'd
still want the plain max delay for non-SR cases, AFAICS, because there's
no sane way to use closed-loop logic in other cases.

Comments?

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-05-03 15:55:24 Re: missing file in git repo
Previous Message Robert Haas 2010-05-03 15:23:20 Re: missing file in git repo