Re: max_standby_delay considered harmful

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: max_standby_delay considered harmful
Date: 2010-05-04 19:50:31
Message-ID: 1273002631.4535.2900.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2010-05-04 at 11:27 -0700, Josh Berkus wrote:

> I still don't see how that works.
...

The good news is we agree by the time we get to the bottom... ;-)

> I'm more interested in your assertion that there's a lot in the
> replication stream which doesn't take a lock; if that's the case, then
> implementing any part of Tom's proposal is hopeless.

(No, still valid, the idea is generic)

> > * standby query delay - defined as the time that recovery will wait for
> > a query to complete before a cancellation takes place. (We could
> > complicate this by asking what happens when recovery is blocked twice by
> > the same query? Would it wait twice, or does it have to track how much
> > it has waited for each query in total so far?)
>
> Aha! Now I see the confusion.

BTW, Tom's proposal was approx half a sentence long so that is the
source of any confusion.

> AFAIK, Tom was proposing that the
> pending recovery data would wait for max_standby_delay, total, then
> cancel *all* queries which conflicted with it. Now that we've talked
> this out, though, I can see that this can still result in "mass cancel"
> issues, just like the current max_standby_delay. The main advantage I
> can see to Tom's idea is that (presumably) it can be more discriminating
> about which queries it cancels.

As I said to Stephen, this is exactly how it works already and wasn't
what was proposed.

> I agree that waiting on *each* query for "up to # time" would be a
> completely different behavior, and as such, should be a option for DBAs.
> We might make it the default option, but we wouldn't make it the only
> option.

Glad to hear you say that.

> Speaking of which, was *your* more discriminating query cancel ever applied?
>
> > Currently max_standby_delay seeks to constrain the standby lag to a
> > particular value, as a way of providing a bounded time for failover, and
> > also to constrain the amount of WAL that needs to be stored as the lag
> > increases. Currently, there is no guaranteed minimum query delay given
> > to each query.
>
> Yeah, I can just see a lot of combinational issues with this. For
> example, what if the user's network changes in some way to retard
> delivery of log segments to the point where the delivery time is longer
> than max_standby_delay? To say nothing about system clock synch, which
> isn't perfect even if you have it set up.
>
> I can see DBAs who are very focussed on HA wanting a standby-lag based
> control anyway, when HA is far more important than the ability to run
> queries on the slave. But I don't that that is the largest group; I
> think that far more people will want to balance the two considerations.
>
> Ultimately, as you say, we would like to have all three knobs:
>
> standby lag: max time measured from master timestamp to slave timestamp
>
> application lag: max time measured from local receipt of WAL records
> (via log copy or recovery connection) to their application

> query lag: max time any query which is blocking a recovery operation can run
>
> These three, in combination, would let us cover most potential use
> cases. So I think you've assessed that's where we're going in the
> 9.1-9.2 timeframe.
>
> However, I'd say for 9.0 that "application lag" is the least confusing
> option and the least dependant on the DBA's server room setup. So if we
> can only have one of these for 9.0 (and I think going out with more than
> one might be too complex, especially at this late date) I think that's
> the way to go.

Before you posted, I submitted a patch on this thread to redefine
max_standby_delay to depend upon the "application lag", as you've newly
defined it here - though obviously I didn't call it that. That solves
Tom's 3 issues. max_apply_delay might be technically more accurate term,
though isn't sufficiently better parameter name as to be worth the
change.

That patch doesn't implement his proposal, but that can be done as well
as (though IMHO not instead of). Given that two people have already
misunderstood what Tom proposed, and various people are saying we need
only one, I'm getting less inclined to have that at all.

--
Simon Riggs www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2010-05-04 20:06:58 Re: Pause/Resume feature for Hot Standby
Previous Message Robert Haas 2010-05-04 19:46:47 Re: including PID or backend ID in relpath of temp rels