Re: Unclear problem reports

From: Andres Freund <andres(at)anarazel(dot)de>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unclear problem reports
Date: 2022-02-05 23:20:45
Message-ID: 20220205232045.czcyitieoybzmwdb@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-02 19:35:36 -0500, Bruce Momjian wrote:
> The Postgres community is great at diagnosing problems and giving users
> feedback. In most cases, we can either diagnose a problem and give a
> fix, or at least give users a hint at finding the cause.
>
> However, there is a class of problems that are very hard to help with,
> and I have perhaps seen an increasing number of them recently, e.g.:
>
> https://www.postgresql.org/message-id/17384-f50f2eedf541e512%40postgresql.org
> https://www.postgresql.org/message-id/CALTnk7uOrMztNfzjNOZe3TdquAXDPD3vZKjWFWj%3D-Fv-gmROUQ%40mail.gmail.com
>
> I consider these as problems that need digging to find the cause, and
> users are usually unable to do sufficient digging, and we don't have
> time to give them instructions, so they never get a reply.
>
> Is there something we can do to improve this situation?

I think some of this can be improved measurably, with moderate effort, by
improving our error messages / diagnostics.

Taking e.g. the second report:

[7804] LOG: starting PostgreSQL 13.1, compiled by Visual C++ build 1914, 64-bit
[8812] LOG: invalid primary checkpoint record
[8812] PANIC: could not locate a valid checkpoint record
[7804] LOG: startup process (PID 8812) was terminated by exception 0xC0000409
[7804] HINT: See C include file "ntstatus.h" for a description of the
hexadecimal value.
[7804] LOG: aborting startup due to startup process failure
[7804] LOG: database system is shut down

We don't provide any details in this log report that allow to diagnose the
problem:
"invalid primary checkpoint record" doesn't say *why* it's invalid, nor does
it say the location at which the checkpoint record was, nor whether the
cluster was shutdown gracefully before. It could be that the permissions on
the WAL files were wrong, it could be missing files, actually invalid WAL, ...

The while following bit about "startup process ... was terminated" and it's
HINT is useless information, because we actually "knowingly" caused that
PANIC. Even disregarding that, it certainly doesn't help to list numerical
exception codes and references to ntstatus.h.

The first report is similar, although it's a bit harder to improve:
ERROR: missing chunk number 0 for toast value 152073604 in pg_toast_2619

Easy to fix: We don't mention the table that pg_toast_2619 corresponds to -
something we can determine, rather forcing the user to figure out how to do
so.

Harder to fix: We don't provide information about where the reference to the
toast value comes from. In general we don't necessarily know that, because
Datum's are handed around fairly widely. But in a lot of cases we actually
could know.

> Should we just tell them they need to hire a Postgres expert? I assume
> these are users who do not already have access to such experts.

I think these are more our fault than our users :(

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-02-05 23:34:45 Re: [BUG]Update Toast data failure in logical replication
Previous Message Noah Misch 2022-02-05 23:02:31 Re: Unclear problem reports