Re: Make mesage at end-of-recovery less scary.

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: david(at)pgmasters(dot)net
Cc: peter(dot)eisentraut(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, pgsql-hackers(at)lists(dot)postgresql(dot)org, jtc331(at)gmail(dot)com, robertmhaas(at)gmail(dot)com
Subject: Re: Make mesage at end-of-recovery less scary.
Date: 2021-03-04 06:50:39
Message-ID: 20210304.155039.2020434514871976864.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Wed, 3 Mar 2021 11:14:20 -0500, David Steele <david(at)pgmasters(dot)net> wrote in
> Hi Kyotaro,
>
> On 3/27/20 10:25 PM, James Coleman wrote:
> > On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmhaas(at)gmail(dot)com>
> > wrote:
> >>
> >> I'm just spitballing here, but it would be really good if there's a
> >> way to know definitely whether or not you should be scared. Corrupted
> >> WAL segments are definitely a thing that happens, but retries are a
> >> lot more common.
> > First, I agree that getting enough context to say precisely is by far
> > the ideal.
> > That being said, as an end user who's found this surprising -- and
> > momentarily scary every time I initially scan it even though I *know
> > intellectually it's not* -- I would find Peter's suggestion a
> > significant improvement over what we have now. I'm fairly certainly my
> > co-workers on our database team would also. Knowing that something is
> > at least not always scary is good. Though I'll grant that this does
> > have the negative in reverse: if it actually is a scary
> > situation...this mutes your concern level. On the other hand,
> > monitoring would tell us if there's a real problem (namely replication
> > lag), so I think the trade-off is clearly worth it.
> > How about this minor tweak:
> > HINT: This is expected if this is the end of currently available WAL.
> > Otherwise, it could indicate corruption.
>
> Any thoughts on the suggestions for making the messaging clearer?
>
> Also, the patch no longer applies:
> http://cfbot.cputube.org/patch_32_2490.log.

Sorry for missing the last discussions. I agree to the point about
really-scary situation.

ValidXLogRecordHeader deliberately marks End-Of-WAL only in the case
of zero-length record so that the callers can identify that case,
instead of inferring the EOW state without it. All other invalid data
is treated as potentially danger situation. I think this is a
reasonable classification. And the error level for the "danger" cases
is changed to WARNING (from LOG).

As the result, the following messages are emitted with the attached.

- found zero-length record during recovery (the DETAIL might not be needed.)

> LOG: redo starts at 0/14000118
> LOG: reached end of WAL at 0/14C5D070 on timeline 1 in pg_wal during crash recovery
> DETAIL: record length is 0 at 0/14C5D070
> LOG: redo done at 0/14C5CF48 system usage: ...

- found another kind of invalid data

> LOG: redo starts at 0/150000A0
> WARNING: invalid record length at 0/1500CA60: wanted 24, got 54
> LOG: redo done at 0/1500CA28 system usage: ...

On the way checking the patch, I found that it emits the following log
lines in the case the redo loop meets an invalid record at the
starting:

> LOG: invalid record length at 0/10000118: wanted 24, got 42
> LOG: redo is not required

which doesn't look proper. That case is identifiable using the
End-of_WAL flag this patch adds. Thus we get the following error
messages.

- found end-of-wal at the beginning of recovery

> LOG: reached end of WAL at 0/130000A0 on timeline 1 in pg_wal during crash recovery
> DETAIL: record length is 0 at 0/130000A0
> LOG: redo is not required

- found invalid data

> WARNING: invalid record length at 0/120000A0: wanted 24, got 42
> WARNING: redo is skipped
> HINT: This suggests WAL file corruption. You might need to check the database.

The logic of ErrRecPtr in ReadRecord may wrong. I remember having
such an discussion before...

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v4-0001-Make-End-Of-Recovery-error-less-scary.patch text/x-patch 8.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-03-04 06:51:40 Re: Improvements and additions to COPY progress reporting
Previous Message Bharath Rupireddy 2021-03-04 06:46:17 Re: Improvements and additions to COPY progress reporting