Re: Should we make scary sounding, but actually routine, errors less scary?

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: andres(at)anarazel(dot)de
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Should we make scary sounding, but actually routine, errors less scary?
Date: 2019-11-06 08:36:09
Message-ID: 20191106.173609.355066219628074584.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 5 Nov 2019 15:54:22 -0800, Andres Freund <andres(at)anarazel(dot)de> wrote in
> Hi,
>
> There's a few errors that we issue that are, often, much less bad than
> they sound. The most common cases I immediately can recall are:
>
>
> 1) Mentioning crash, once for each backend, when shutting down
> immediately. Currently the log output for that, with just two sessions
> connected, is the following:
>
> 2019-11-05 15:09:52.634 PST [9340][] LOG: 00000: received immediate shutdown request
> 2019-11-05 15:09:52.634 PST [9340][] LOCATION: pmdie, postmaster.c:2883
> 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection because of crash of another server process
> 2019-11-05 15:09:52.634 PST [23199][4/0] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2019-11-05 15:09:52.634 PST [23199][4/0] HINT: In a moment you should be able to reconnect to the database and repeat your command.
> 2019-11-05 15:09:52.634 PST [23199][4/0] LOCATION: quickdie, postgres.c:2734
> 2019-11-05 15:09:52.634 PST [23187][3/0] WARNING: 57P02: terminating connection because of crash of another server process
> 2019-11-05 15:09:52.634 PST [23187][3/0] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2019-11-05 15:09:52.634 PST [23187][3/0] HINT: In a moment you should be able to reconnect to the database and repeat your command.
> 2019-11-05 15:09:52.634 PST [23187][3/0] LOCATION: quickdie, postgres.c:2734
> 2019-11-05 15:09:52.634 PST [9345][1/0] WARNING: 57P02: terminating connection because of crash of another server process
> 2019-11-05 15:09:52.634 PST [9345][1/0] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2019-11-05 15:09:52.634 PST [9345][1/0] HINT: In a moment you should be able to reconnect to the database and repeat your command.
> 2019-11-05 15:09:52.634 PST [9345][1/0] LOCATION: quickdie, postgres.c:2734
> 2019-11-05 15:09:52.644 PST [9340][] LOG: 00000: database system is shut down
> 2019-11-05 15:09:52.644 PST [9340][] LOCATION: UnlinkLockFiles, miscinit.c:859
>
> (23199, 23187 are backends, 9345 is the autovacuum launcher)
>
> I think there's multiple things wrong with this. For one, reading that
> the server has (no might in there) crashed, is scary, when all that's
> happening is an intentional shutdown. But also the sheer log volume is
> bad - on a busy server there may be a *lot* of these lines.
>
> I think the log volume is bad even for an actual PANIC. I've spent *way*
> too much time scrolling through pages and pages of the above lines, just
> to find the one or two lines indicating an actual error.
>
> It seems like we ought to be able to somehow
>
> a) Signal that the server has been shut down in immediate mode, rather
> than actually crashed, and issue a different log message to the user.
>
> b) Stop issuing, to the server log, the same message over and over. We
> instead just ought to send the message to the client. We however need to
> be careful that we don't make it harder to debug a SIGQUIT sent directly
> to backend processes.

I doubt that different messages for server log and client worth
doing. Isn't it enough moving the cause description from backend
message to postmaster one?

Addition to that, I don't see such a message on connecting psql. It
just reports as "server closed the connection unexpectedly" without a
server message. If I'm not missing something the HINT message is
useless..

2019-11-05 15:09:52.634 PST [9340][] LOG: 00000: received immediate shutdown request
2019-11-05 15:09:52.634 PST [9340][] LOCATION: pmdie, postmaster.c:2883
+ 2019-11-05 15:09:52.634 PST [9340][] LOG: terminating all active server processes
- 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection because of crash of another server process
+ 2019-11-05 15:09:52.634 PST [23199][4/0] WARNING: 57P02: terminating connection due to command by postmaster
- 2019-11-05 15:09:52.634 PST [23199][1/0] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
- 2019-11-05 15:09:52.634 PST [23199][1/0] HINT: In a moment you should be able to reconnect to the database and repeat your command.

If some process has crashed, it looks like:

2019-11-06 10:47:11.387 JST [19737] LOG: 00000: server process (PID 19774) was terminated by signal 11: Segmentation fault
2019-11-06 10:47:11.387 JST [19737] DETAIL: Failed process was running: select pg_sleep(1200);
2019-11-06 10:47:11.387 JST [19737] LOCATION: LogChildExit, postmaster.c:3749
2019-11-06 10:47:11.387 JST [19737] LOG: 00000: terminating any other active server processes, because a server process exited abnormally and possibly corrupted shared memory.
2019-11-06 10:47:11.387 JST [19737] LOCATION: HandleChildCrash, postmaster.c:3469
2019-11-06 10:47:11.387 JST [19800] WARNING: 57P02: terminating connection due to command by postmaster
2019-11-06 10:47:11.387 JST [19800] LOCATION: quickdie, postgres.c:2736

> 2) At the end of crash recovery, and often when the startup processes
> switches between WAL sources, we get scary messages like:
>
> 2019-11-05 15:20:21.907 PST [23407][] <> <> LOG: 00000: invalid record length at F/48C0A500: wanted 24, got 0
> 2019-11-05 15:20:21.907 PST [23407][] <> <> LOCATION: ReadRecord, xlog.c:4282
> or
> 2019-11-05 15:35:03.321 PST [28518][1/0] LOG: invalid resource manager ID 52 at 3/7CD0B1B8
>
> or any of the other types of xlogreader errors.
>
> I've seen countless tickets being raised because PG users looked at
> their logs and got scared. We want them to look at the logs, so this
> seems counter-productive.
>
> It seems we, at the very least, could add an error context or something
> explaining that a LOG message about the log end is to be expected. Or
> perhaps we could should reformulate the message to something like
> 'ELEVEL: reached end of valid WAL at XX/XX'
> 'DETAIL: end determined due to invalid record length: wanted 24, got 0'
>
> perhaps with a HINT in the elevel < ERROR case indicating that this is
> not a failure.

The proposed message seems far less scary. +1.

> 3) When a standby node is shutdown in immediate mode, we issue:
>
> 2019-11-05 15:45:58.722 PST [30321][] LOG: database system was interrupted while in recovery at log time 2019-11-05 15:37:43 PST
> 2019-11-05 15:45:58.722 PST [30321][] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
>
> Which imo doesn't really make sense for at least standbys, which are
> expected to be in recovery forever. There's more than enough scenarios
> where one would shut down a standby in immediate mode for good reasons,
> we shouldn't issue such warnings in that case.
>
>
> The tricky part in improving this would be how to detect when the
> message should still be issued for a standby. One idea, which is not
> bullet proof but might be good enough, would be to record in the control
> file which position recovery was started from last time, and only issue
> the error when recovery would start from the same point.

Recovery always starts from the latest REDO point. Maybe
MinRecoveryPoint is better for the use.

> I'm sure there are more types of messages in this category, these are
> just the ones I could immediately recall from memory as having scared
> actual users unnecessarily.

I often see inquiries on "FATAL: the database system is starting
up". It is actually FATAL for backends internally but it is also
overly scary for users.

> I don't plan on fixing these immediately myself, even if we were to
> agree on something, so if anybody is interested in helping...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2019-11-06 09:01:40 Re: tableam vs. TOAST
Previous Message Dilip Kumar 2019-11-06 08:32:47 Re: cost based vacuum (parallel)