Re: max_standby_delay considered harmful

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: max_standby_delay considered harmful
Date: 2010-05-08 23:04:23
Message-ID: 4BE5EDF7.6030305@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Bruce Momjian wrote:
> I think the big question is whether this issue is significant enough
> that we should ignore our policy of no feature design during beta

The idea that you're considering removal of a feature that we already
have people using in beta and making plans around is a policy violation
too you know. A freeze should include not cutting things just because
their UI or implementation is not ideal yet. And you've been using the
word "consensus" here when there is no such thing. At best there's
barely a majority here among people who have stated an opinion, and
consensus means something much stronger even than that; that means
something closer to unanimity. I thought the summary of where the
project is at Josh wrote at
http://archives.postgresql.org/message-id/4BE31279.7040002@agliodbs.com
was excellent, both from a technical and a process commentary
standpoint. I'd be completely happy to follow that plan, and then we'd
be at a consensus--with no one left arguing.

It was very clear back in February that if SR didn't hit the feature set
to make HS less troublesome out of the box, there would be some
limitations here, and that set of concerns hasn't changed much since
then. I thought the backup plan if we didn't get things like xid
feedback was to keep the capability as written anyway, knowing that it's
still much better than no control over cancellation timing available at
all. Keep improving documentation around its issues, and continue to
hack away at them in user space and in the field. Then we do better for
9.1. You seem bent on removing the feedback part of that cycle.

The full statement of the ESR bit Josh was quoting is "Release early.
Release often. And listen to your customers."[1] My customers include
some of whom believed the PostgreSQL community process enough to
contribute toward the HS development that's been completed and donated
to the project. They have a pretty clear view on this I'm relaying when
I talk about what I'd like to see happen. They are saying they cannot
completely ignore their requirements for HA failover, but would be
willing to loosen them just a bit (increasing failover time slightly) if
it reduces the odds of query cancellation, and therefore improves how
much load they can expect to push toward the standby. max_standby_delay
is a currently available mechanism that does that. I'm not going to be
their nanny and say "no, that's not perfectly predictable, you might get
a query canceled sometimes when you don't expect it anyway".

Instead, I was hoping to let them deploy 9.0 with this option available
(but certainly not the default), informed of the potential risks, see
how that goes. We can confirm whether the userland workarounds we
believe will be effective here really are. If so, then we can solider
forward directly incorporating them into the server code, knowing that
works. If not, switch to one of the safer modes, see if there's
something better to use altogether in 9.1, and perhaps this whole
approach gets removed. That's healthy development progress either way.

Upthread Bruce expressed some concern that this was going to live
forever once deployed. There is no way I'm going to let this behavior
continue to be available in 9.1 if field tests say the workarounds
aren't good enough. That's going to torture all of us who do customer
deployments of this technology every day if that turns out to be the
case, and nobody is going to feel the heat from that worse than
2ndQuadrant. I did a round once of removing GUCs that didn't do what
they were expected to in the field before, based on real-world tests
showing regular misuse, and I'll do it again if this falls into that
same category. We've already exposed this release to a whole stack of
risk from work during its development cycle, risk that doesn't really
drop much just from cutting this one bit. I'd at least like to get all
the reward possible from that risk, which I expected to include feedback
in this area.

Circumventing the planned development process by dropping this now will
ruin how I expected the project to feel out the right thing on the user
side, and we'll all be left with little more insight for what to do in
9.1 than we have now. And I'm not looking forward to explaining to
people why a feature they've been seeing and planning to deploy for
months has now been cut only after what was supposed to be a freeze for
beta.

[1]
http://catb.org/esr/writings/homesteading/cathedral-bazaar/ar01s04.html
, and this particular bit is quite relevant here: "Linus was keeping his
hacker/users constantly stimulated and rewarded—stimulated by the
prospect of having an ego-satisfying piece of the action, rewarded by
the sight of constant (even daily) improvement in their work. Linus was
directly aiming to maximize the number of person-hours thrown at
debugging and development, even at the possible cost of instability in
the code and user-base burnout if any serious bug proved intractable." I
continue to be disappointed at how contributing code to PostgreSQL is
far more likely to come with a dose of argument and frustration rather
than reward, and this discussion is a perfect example of such.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2010-05-08 23:34:18 Re: max_standby_delay considered harmful
Previous Message Bruce Momjian 2010-05-08 22:51:43 Re: max_standby_delay considered harmful