Re: [CORE] Restore-reliability mode

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, pgsql-core <pgsql-core(at)postgresql(dot)org>
Subject: Re: [CORE] Restore-reliability mode
Date: 2015-06-05 15:05:14
Message-ID: 20150605150514.GA25537@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 5, 2015 at 07:50:31AM +0100, Simon Riggs wrote:
> On 3 June 2015 at 18:21, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>  
>
> I would argue that if we delay 9.5 in order to do a 100% manual review
> of code, without adding any new automated tests or other non-manual
> tools for improving stability, then it's a waste of time; we might as
> well just release the beta, and our users will find more issues than we
> will.  I am concerned that if we declare a cleanup period, especially in
> the middle of the summer, all that will happen is that the project will
> go to sleep for an extra three months.
>
>
> Agreed. Cleanup can occur while we release code for public testing.
>
> Many eyeballs of Beta beats anything we can throw at it thru manual inspection.
> The whole problem of bugs is that they are mostly found by people trying to use
> the software. 

Please address some of the specific issues I mentioned. The problem
with the multi-xact case is that we just kept fixing bugs as people
found them, and did not do a holistic review of the code. I am saying
let's not _keep_ doing that and let's make sure we don't have any
systematic problems in our code where we just keep fixing things without
doing a thorough analysis.

To release 9.5 beta would be to get back into that cycle, and I am not
sure we are ready for that. I think the fact we have multiple people
all reviewing the multi-xact code now (and not dealing with 9.5) is a
good sign. If we were focused on 9.5 beta, I doubt this would have
happened.

I am saying let's make sure we are not deficient in other areas, then
let's move forward again. I would love to think we can do multiple
things at once, but for multi-xact, serious review didn't happen for 18
months, so if slowing release development is what is required, I support
it.

> We've decided previously that having a fixed annual schedule was a good thing
> for the project. Getting the features that work into the hands of the people
> that want them is very important.

Yes, but let's not be a slave to the schedule if our reliability is
suffering, which it clearly has in the past 18 months.

> Discussing halting the development schedule publicly is very damaging. 

Agreed.

> If there are features in doubt, lets do more work on them or just pull them now
> and return to the schedule. I don't really care which ones get canned as long
> as we return to the schedule.

Again, please address my concerns above. This is not about 9.5
features, but rather our overall focus on schedule vs. reliability, and
your arguments are reinforcing my idea that we do not have the proper
balance here.

> Whatever we do must be exact and measurable. If its not, it means we haven't
> assembled enough evidence for action that is sufficiently directed to achieve
> the desired goal.

Sure. I think everyone agrees the multi-xact work is all good, so I am
asking what else needs this kind of research. If there is nothing else,
we can move forward again --- I am just saying we need to ask the
reliability question _first_.

Let me restate something that has appeared in many replies to my ideas
--- I am not asking for infinite or unbounded review, but I am asking
that we make sure reliability gets the proper focus in relation to our
time pressures. Our balance was so off a month ago that I feel only a
full stop on time pressure would allow us to refocus because people are
not good at focusing on multiple things. It is sometimes necessary to
stop everything to get people's attention, and to help them remember
that without reliability, a database is useless.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2015-06-05 15:05:51 Re: [CORE] Restore-reliability mode
Previous Message Andrew Dunstan 2015-06-05 15:04:47 Re: RFC: Remove contrib entirely