Re: Make mesage at end-of-recovery less scary.

From: James Coleman <jtc331(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Make mesage at end-of-recovery less scary.
Date: 2020-03-28 02:25:29
Message-ID: CAAaqYe88ENQp=ksG4c1J-Mi1axM5dtO+48x8tC3afE-_Z_qFSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut
> <peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> > HINT: This is to be expected if this is the end of the WAL. Otherwise,
> > it could indicate corruption.
>
> First, I agree that this general issue is a problem, because it's come
> up for me in quite a number of customer situations. Either people get
> scared when they shouldn't, because the message is innocuous, or they
> don't get scared about other things that actually are scary, because
> if some scary-looking messages are actually innocuous, it can lead
> people to believe that the same is true in other cases.
>
> Second, I don't really like the particular formulation you have above,
> because the user still doesn't know whether or not to be scared. Can
> we figure that out? I think if we're in crash recovery, I think that
> we should not be scared, because we have no alternative to assuming
> that we've reached the end of WAL, so all crash recoveries will end
> like this. If we're in archive recovery, we should definitely be
> scared if we haven't yet reached the minimum recovery point, because
> more WAL than that should certainly exist. After that, it depends on
> how we got the WAL. If it's being streamed, the question is whether
> we've reached the end of what got streamed. If it's being copied from
> the archive, we ought to have the whole segment, but maybe not more.
> Can we get the right context to the point where the error is being
> reported to know whether we hit the error at the end of the WAL that
> was streamed? If not, can we somehow rejigger things so that we only
> make it sound scary if we keep getting stuck at the same point when we
> woud've expected to make progress meanwhile?
>
> I'm just spitballing here, but it would be really good if there's a
> way to know definitely whether or not you should be scared. Corrupted
> WAL segments are definitely a thing that happens, but retries are a
> lot more common.

First, I agree that getting enough context to say precisely is by far the ideal.

That being said, as an end user who's found this surprising -- and
momentarily scary every time I initially scan it even though I *know
intellectually it's not* -- I would find Peter's suggestion a
significant improvement over what we have now. I'm fairly certainly my
co-workers on our database team would also. Knowing that something is
at least not always scary is good. Though I'll grant that this does
have the negative in reverse: if it actually is a scary
situation...this mutes your concern level. On the other hand,
monitoring would tell us if there's a real problem (namely replication
lag), so I think the trade-off is clearly worth it.

How about this minor tweak:
HINT: This is expected if this is the end of currently available WAL.
Otherwise, it could indicate corruption.

Thanks,
James

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2020-03-28 02:58:30 Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Previous Message Alvaro Herrera 2020-03-28 01:52:26 Re: pgbench - refactor init functions with buffers