Re: Restore-reliability mode

From: Noah Misch <noah(at)leadboat(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, pgsql-core <pgsql-core(at)postgresql(dot)org>
Subject: Re: Restore-reliability mode
Date: 2015-06-06 19:58:05
Message-ID: 20150606195805.GA118899@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 05, 2015 at 08:25:34AM +0100, Simon Riggs wrote:
> This whole idea of "feature development" vs reliability is bogus. It
> implies people that work on features don't care about reliability. Given
> the fact that many of the features are actually about increasing database
> reliability in the event of crashes and corruptions it just makes no sense.

I'm contrasting work that helps to keep our existing promises ("reliability")
with work that makes new promises ("features"). In software development, we
invariably hazard old promises to make new promises; our success hinges on
electing neither too little nor too much risk. Two years ago, PostgreSQL's
track record had placed it in a good position to invest in new, high-risk,
high-reward promises. We did that, and we emerged solvent yet carrying an
elevated debt service ratio. It's time to reduce risk somewhat.

You write about a different sense of "reliability." (Had I anticipated this
misunderstanding, I might have written "Restore-probity mode.") None of this
was about classifying people, most of whom allocate substantial time to each
kind of work.

> How will we participate in cleanup efforts? How do we know when something
> has been "cleaned up", how will we measure our success or failure? I think
> we should be clear that wasting N months on cleanup can *fail* to achieve a
> useful objective. Without a clear plan it almost certainly will do so. The
> flip side is that wasting N months will cause great amusement and dancing
> amongst those people who wish to pull ahead of our open source project and
> we should take care not to hand them a victory from an overreaction.

I agree with all that. We should likewise take care not to become insolvent
from an underreaction.

> So lets do our normal things, not do a "total stop" for an indefinite
> period. If someone has specific things that in their opinion need to be
> addressed, list them and we can talk about doing them, together.

I recommend these four exit criteria:

1. Non-author committer review of foreign keys locks/multixact durability.
Done when that committer certifies, as if he were committing the patch
himself today, that the code will not eat data.

2. Non-author committer review of row-level security. Done when that
committer certifies that the code keeps its promises and that the
documentation bounds those promises accurately.

3. Second committer review of the src/backend/access changes for INSERT ... ON
CONFLICT DO NOTHING/UPDATE. (Bugs affecting folks who don't use the new
syntax are most likely to fall in that portion.) Unlike the previous two
criteria, a review without certification is sufficient.

4. Non-author committer certifying that the 9.5 WAL format changes will not
eat your data. The patch lists Andres and Alvaro as reviewers; if they
already reviewed it enough to make that certification, this one is easy.

That ties up four people. For everyone else:

- Fix bugs those reviews find. This will start slow but will grow to keep
everyone busy. Committers won't certify code, and thus we can't declare
victory, until these bugs are fixed. The rest of this list, in contrast,
calls out topics to sample from, not topics to exhaust.

- Turn current buildfarm members green.

- Write, review and commit more automated test machinery to PostgreSQL. Test
whatever excites you. If you need ideas, Craig posted some good ones
upthread. Here are a few more:
- Add a debug mode that calls sched_yield() in SpinLockRelease(); see
6322(dot)1406219591(at)sss(dot)pgh(dot)pa(dot)us(dot)
- Improve TAP suite (src/test/perl/TestLib.pm) logging. Currently, these
suites redirect much output to /dev/null. Instead, log that output and
teach the buildfarm to capture the log.
- Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin
count falls to zero. Under CLOBBER_FREED_MEMORY, wipe a shared buffer
when its global pin count falls to zero.
- With assertions enabled, or perhaps in a new debug mode, have
pg_do_encoding_conversion() and pg_server_to_any() check the data for a
no-op conversion instead of assuming the data is valid.

- Add buildfarm members. This entails reporting any bugs that prevent an
initial passing run. Once you have a passing run, schedule regular runs.
Examples of useful additions:
- "./configure ac_cv_func_getopt_long=no, ac_cv_func_snprintf=no ..." to
enable all the replacement code regardless of the current platform's need
for it. This helps distinguish "Windows bug" from "replacement code bug."
- --disable-integer-datetimes, --disable-float8-byval, disable-float4-byval,
--disable-spinlocks, --disable-atomics, disable-thread-safety,
--disable-largefile, #define RANDOMIZE_ALLOCATED_MEMORY
- Any OS or CPU architecture other than x86 GNU/Linux, even ones already
represented.

- Write, review and commit fixes for the bugs that come to light by way of
these new automated tests.

- Anything else targeted to make PostgreSQL keep the promises it has already
made to our users.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Korobeinikov 2015-06-06 19:59:21 Re: psql :: support for \ev viewname and \sv viewname
Previous Message Dan Langille 2015-06-06 17:13:59 Re: could not truncate directory "pg_subtrans": apparent wraparound