Re: a small proposal for avoiding foot-shooting

From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: a small proposal for avoiding foot-shooting
Date: 2008-12-21 08:22:28
Message-ID: 87wsdux7i3.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> I propose that this behaviour be changed such that 'terse' is
>> ignored for all log messages of FATAL or PANIC severity.
>> [ on the strength of a single example ]

Tom> This seems like using a blunderbuss where a rifle is called for.

Maybe so.

Tom> There may indeed be some places where we have HINTS that are
Tom> conveying pretty important information, but I see no argument
Tom> whatsoever that the importance of a hint is determined by the
Tom> severity level of the message it's attached to.

With a small number of exceptions, FATAL and PANIC messages are of
the form "the database won't start (due to X)" or "the database just
died (due to X)". A relatively small proportion of them have errhint
or errdetail records, but those that do have detail records also tend
to have extremely unhelpful errmsg text.

In fact (at least in 8.3) there is only one PANIC message with an
errhint, and only one with an errdetail, and both of those ought to be
in the "must print to avoid confusing the DBA" category. The FATAL
messages are more of a mixed bag.

Tom> I could see inventing some kind of additional ereport decoration
Tom> that says "force the hint to be printed", but realize that this
Tom> is only likely to have any effect in the postmaster log --- we
Tom> can't guarantee to control what clients do with subsidiary
Tom> message fields. So the value seems a bit limited anyway.

For PANIC messages especially, the postmaster log is really what
counts.

Tom> It seems like it might be better to rephrase error messages to
Tom> ensure that anything really critical is mentioned in the primary
Tom> message.

Tom> In this case, perhaps instead of
Tom> errmsg("could not locate required checkpoint record")
Tom> we could have it print
Tom> errmsg("could not locate checkpoint record specified in file
Tom> \"%s/backup_label\".", DataDir)
Tom> assuming we did actually get the location from there.

That's still not capturing the important part of the HINT message in
this specific case, which is "you must remove the backup_label file
now if you're not trying to restore from a backup".

(The current behaviour where recovery CANNOT succeed without manual
intervention if the database went down while pg_start_backup is in
effect is of course entirely suboptimal. Lack of clear direction in
the error message as to what to do in that circumstance is pretty
much unforgiveable.)

Tom> Anyway, you've omitted a lot of details that would be necessary
Tom> to judge exactly what was misleading about what the DBA saw.

This is exactly what the DBA saw (following a pg_ctl restart -mimmediate):

----
2008-12-20 10:26:57 EST FATAL: the database system is starting up
2008-12-20 10:26:57 EST LOG: database system was interrupted; last known up at 2008-12-20 10:24:00 EST
2008-12-20 10:26:57 EST FATAL: the database system is starting up
2008-12-20 10:26:57 EST FATAL: the database system is starting up
2008-12-20 10:26:57 EST LOG: could not open file "pg_xlog/00000001000001E100000087" (log file 481, segment 135): No such file or directory
2008-12-20 10:26:57 EST LOG: invalid checkpoint record
2008-12-20 10:26:57 EST PANIC: could not locate required checkpoint record
2008-12-20 10:26:57 EST LOG: startup process (PID 1634) was terminated by signal 6: Aborted
2008-12-20 10:26:57 EST LOG: aborting startup due to startup process failure
----

(Earliest xlog file actually present at that time was
00000001000001E20000004A.)

--
Andrew.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2008-12-21 08:23:06 Re: a small proposal for avoiding foot-shooting
Previous Message Heikki Linnakangas 2008-12-21 08:19:25 Re: Hot standby and b-tree killed items