Re: [CORE] Restore-reliability mode

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, pgsql-core <pgsql-core(at)postgresql(dot)org>
Subject: Re: [CORE] Restore-reliability mode
Date: 2015-06-05 15:54:56
Message-ID: CANP8+jJd+7hncAmHUZCETxSPf0Ef9uKh227LBK4xSA9p1k0AYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5 June 2015 at 16:05, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

>
> Please address some of the specific issues I mentioned.

I can discuss them but not because I am involved directly. I take
responsibility as a committer and have an interest from that perspective.

In my role at 2ndQuadrant, I approved all of the time Alvaro and Andres
spent on submitting, reviewing and fixing bugs - at this point that has
cost something close to fifty thousand dollars just on this feature and
subsequent actions. (I believe the feature was originally funded, but we
never saw a penny of that, though others did.)

> The problem
> with the multi-xact case is that we just kept fixing bugs as people
> found them, and did not do a holistic review of the code.

I observed much discussion and review. The bugs we've had have all been
fairly straightforwardly fixed. There haven't been any design-level
oversights or head-palm moments. It's complex software that had complex
behaviour that caused problems. The problem has been that anything on-disk
causes more problems when errors occur. We should review carefully anything
that alters the way on-disk structures work, like the WAL changes, UPSERTs
new mechanism etc..

From my side, it is only recently I got some clear answers to my questions
about how it worked. I think it is very important that major features have
extensive README type documentation with them so the underlying principles
used in the development are clear. I would define the measure of a good
feature as whether another committer can read the code comments and get a
good feel. A bad feature is one where committers walk away from it, saying
I don't really get it and I can't read an explanation of why it does that.
Tom's most significant contribution is his long descriptive comments on
what the problem is that need to be solved, the options and the method
chosen. Clarity of thought is what solves bugs.

Overall, I don't see the need to stop the normal release process and do a
holistic review. But I do think we should check each feature to see whether
it is fully documented or whether we are simply trusting one of us to be
around to fix it.

I am just saying we need to ask the
> reliability question _first_.
>

Agreed

> Let me restate something that has appeared in many replies to my ideas
> --- I am not asking for infinite or unbounded review, but I am asking
> that we make sure reliability gets the proper focus in relation to our
> time pressures. Our balance was so off a month ago that I feel only a
> full stop on time pressure would allow us to refocus because people are
> not good at focusing on multiple things. It is sometimes necessary to
> stop everything to get people's attention, and to help them remember
> that without reliability, a database is useless.
>

Here, I think we are talking about different types of reliability.
PostgreSQL software is well ahead of most industry measures of quality;
these recent bugs have done nothing to damage that, other than a few people
woke up and said "Wow! Postgres had a bug??!?!?". The presence of bugs is
common and if we have grown unused to them, we should be wary of that,
though not tolerant.

PostgreSQL is now reliable in the sense that we have many features that
ensure availability even in the face of software problems and bug induced
corruption. Those have helped us get out of the current situations, giving
users a workaround while bugs are fixed. So the impact of database software
bugs is not what it once was.

Reliable delivery of new versions of software is important too. New
versions often contain new features that fix real world problems, just as
much as bug fixes do, hence why I don't wish to divert from the normal
process and schedule.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-06-05 15:55:12 Re: [CORE] Restore-reliability mode
Previous Message Bruce Momjian 2015-06-05 15:52:25 Re: [CORE] Restore-reliability mode