Re: RFC: Add 'taint' field to pg_control.

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: RFC: Add 'taint' field to pg_control.
Date: 2018-03-01 01:03:30
Message-ID: CAMsr+YGJqHDP=HkLxAukhVz0R56MTfEj1++t8M-AWb+xFTwZqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1 March 2018 at 05:43, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> a significant number of times during investigations of bugs I wondered
> whether running the cluster with various settings, or various tools
> could've caused the issue at hand. Therefore I'd like to propose adding
> a 'tainted' field to pg_control, that contains some of the "history" of
> the cluster. Individual bits inside that field that I can think of right
> now are:
> - pg_resetxlog was used non-passively on cluster
> - ran with fsync=off
> - ran with full_page_writes=off
> - pg_upgrade was used
>
> What do others think?
>
>
A huge +1 from me for the idea. I can't even count the number of black box
"WTF did you DO?!?" servers I've looked at, where bizarre behaviour has
turned out to be down to the user doing something very silly and not saying
anything about it.

It's only some flags, so putting it in pg_control is arguably somewhat
wasteful but so minor as to be of no real concern. And that's probably the
best way to make sure it follows the cluster around no matter what
backup/restore/copy mechanisms are used and how "clever" they try to be.

What I'd _really_ love would be to blow the scope of this up a bit and turn
it into a key-events cluster journal, recording key param switches,
recoveries (and lsn ranges), pg_upgrade's, etc. But then we'd run into
people with weird workloads who managed to make it some massive file, we'd
have to make sure we had a way to stop it getting left out of
copies/backups, and it'd generally be irritating. So lets not do that.
Proper support for class-based logging and multiple outputs would be a good
solution for this at some future point.

What you propose is simple enough to be quick to implement, adds no admin
overhead, and will be plenty useful enough.

I'd like to add "postmaster.pid was absent when the cluster started" to
this list, please. Sure, it's not conclusive, and there are legit reasons
why that might be the case, but so often it's somebody kill -9'ing the
postmaster, then removing the postmaster.pid and starting up again without
killing the workers....

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2018-03-01 01:12:18 Re: RFC: Add 'taint' field to pg_control.
Previous Message Daniel Gustafsson 2018-03-01 01:00:56 Re: Two small patches for the isolationtester lexer