Re: Should we make scary sounding, but actually routine, errors less scary?

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Should we make scary sounding, but actually routine, errors less scary?
Date: 2019-11-06 13:21:59
Message-ID: 20191106132159.GS6962@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Andres Freund (andres(at)anarazel(dot)de) wrote:
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:

Agreed.

> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:

Eh, +1 or +2.

> It seems like we ought to be able to somehow
>
> a) Signal that the server has been shut down in immediate mode, rather
> than actually crashed, and issue a different log message to the user.

That'd be nice, but I anticipate the question coming up of "how much is
too much to do before an immediate shutdown?"

> b) Stop issuing, to the server log, the same message over and over. We
> instead just ought to send the message to the client. We however need to
> be careful that we don't make it harder to debug a SIGQUIT sent directly
> to backend processes.

Would be nice if we could improve that, agreed.

> 2) At the end of crash recovery, and often when the startup processes
> switches between WAL sources, we get scary messages like:

+1000

> I've seen countless tickets being raised because PG users looked at
> their logs and got scared. We want them to look at the logs, so this
> seems counter-productive.

Yes, agreed, this happens all the time and would be good to improve.

> It seems we, at the very least, could add an error context or something
> explaining that a LOG message about the log end is to be expected. Or
> perhaps we could should reformulate the message to something like
> 'ELEVEL: reached end of valid WAL at XX/XX'
> 'DETAIL: end determined due to invalid record length: wanted 24, got 0'
>
> perhaps with a HINT in the elevel < ERROR case indicating that this is
> not a failure.

Something like that does look like an improvement.

> 3) When a standby node is shutdown in immediate mode, we issue:
>
> 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted while in recovery at log time 2019-11-05 15:37:43 PST
> 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
>
> Which imo doesn't really make sense for at least standbys, which are
> expected to be in recovery forever. There's more than enough scenarios
> where one would shut down a standby in immediate mode for good reasons,
> we shouldn't issue such warnings in that case.
>
>
> The tricky part in improving this would be how to detect when the
> message should still be issued for a standby. One idea, which is not
> bullet proof but might be good enough, would be to record in the control
> file which position recovery was started from last time, and only issue
> the error when recovery would start from the same point.

Yeah... This sounds like it would be more difficult to tackle, though I
agree it'd be nice to improve on this too.

Thanks,

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message 曾文旌 (义从) 2019-11-06 13:24:36 Re: [Proposal] Global temporary tables
Previous Message Peter Eisentraut 2019-11-06 12:27:05 Re: Collation versioning