Re: prevent immature WAL streaming

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: masao(dot)fujii(at)oss(dot)nttdata(dot)com
Cc: andres(at)anarazel(dot)de, alvherre(at)alvh(dot)no-ip(dot)org, pgsql-hackers(at)lists(dot)postgresql(dot)org, bossartn(at)amazon(dot)com, mengjuan(dot)cmj(at)alibaba-inc(dot)com, Jakub(dot)Wartak(at)tomtom(dot)com
Subject: Re: prevent immature WAL streaming
Date: 2021-09-02 00:24:15
Message-ID: 20210902.092415.728659185237934198.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Wed, 1 Sep 2021 15:01:43 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>
>
> On 2021/09/01 12:15, Andres Freund wrote:
> > Hi,
> > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> >> On 2021/09/01 0:53, Andres Freund wrote:
> >>> Of course, we need to be careful to not weaken WAL validity checking
> >>> too
> >>> much. How about the following:
> >>>
> >>> If we're "aborting" a continued record, we set
> >>> XLP_FIRST_IS_ABORTED_PARTIAL on
> >>> the page at which we do so (i.e. the page after the valid end of the
> >>> WAL).
> >>
> >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> >> when recovery finds a a partially-flushed segment-spanning record?
> >> But maybe we cannot do that (i.e., cannot overwrite the page) because
> >> the page that the flag is set in might have already been archived. No?
> > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the
> > "tail
> > end" of a partial record. I.e. if there's a partial record starting in
> > the
> > successfully archived segment A, but the end of the record, in B, has
> > not been
> > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL
> > at the
> > end of the valid data in B. Which could not have been archived yet, or
> > we'd
> > not have a partial record. So we should never need to set the flag on
> > an
> > already archived page.
>
> Thanks for clarifying that! Unless I misunderstand that, when recovery
> ends
> at a partially-flushed segment-spanning record, it sets
> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting
> writing
> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
> a partially-flushed segment-spanning record. If none of them is set,
> the validation code in recovery should report an error.
>
> Yes, this design looks good to me.

So, this is a crude PoC of that.

At the end of recovery:

- When XLogReadRecord misses a page where the next part of the current
continuation record should be seen, xlogreader->ContRecAbortPtr is
set to the beginning of the missing page.

- When StartupXLOG receives a valid ContRecAbortPtr, the value is used
as the next WAL insertion location then sets the same to
XLogCtl->contAbortedRecPtr.

- When XLogCtl->contAbortedRecPtr is set, AdvanceXLInsertBuffer()
(called under XLogInsertRecord()) sets XLP_FIRST_IS_ABORTED_PARTIAL
flag to the page.

While recovery:
- When XLogReadRecord meets a XLP_FIRST_IS_ABORT_PARTIAL page, it
rereads a record from there.

In this PoC,

1. This patch is written on the current master, but it doesn't
interfare with the seg-boundary-memorize patch since it removes the
calls to RegisterSegmentBoundary.

2. Since xlogreader cannot emit a log-message immediately, we don't
have a means to leave a log message to inform recovery met an
aborted partial continuation record. (In this PoC, it is done by
fprintf:p)

3. Myebe we need to pg_waldump to show partial continuation records,
but I'm not sure how to realize that.

4. By the way, is this (method) applicable in this stage?

The attached first is the PoC including debug-crash aid. The second
is the repro script. It failes to reproduce the situation once per
several trials.

The following log messages are shown by a run of the script.

> ...
> TRAP: FailedAssertion("c++ < 1", File: "xlog.c", Line: 2675, PID: 254644)
> ...
> LOG: database system is shut down
> ...
>
> LOG: redo starts at 0/2000028
> LOG: redo done at 0/6FFFFA8 system usage:...
> LOG: #### Recovery finished: ContRecAbort: 0/7000000 (EndRecPtr: 0/6FFFFE8)

The record from 6FFFFE8 is missing the trailing part after 7000000.

> LOG: #### EndOfLog=0/7000000
> LOG: #### set XLP_FIRST_IS_ABORT_PARTIAL(at)0/7000000

So, WAL insertion starts from 7000000 and the first page is set the flag.

> LOG: database system is ready to accept connections
> ...
> LOG: database system is shut down
> ...
> #########################
> ...
> LOG: redo starts at 0/2000028
> LOG: consistent recovery state reached at 0/2000100
> ...
> LOG: restored log file "000000010000000000000007" from archive
> #### aborted partial continuation record found at 0/6FFFFE8, continue from 0/7000000

The record from 6FFFFE8 is immature so skip it and continue reading
from 7000000.

> LOG: last completed transaction was at log time 2021-09-01 20:40:21.775295+09
> LOG: #### Recovery finished: ContRecAbort: 0/0 (EndRecPtr: 0/8000000)
> LOG: restored log file "000000010000000000000007" from archive
> LOG: selected new timeline ID: 2
> LOG: archive recovery complete
> LOG: #### EndOfLog=0/8000000

Recovery ends.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2021-09-02 01:16:40 Re: Possible missing segments in archiving on standby
Previous Message Jacob Champion 2021-09-02 00:09:49 Re: [PATCH] test/ssl: rework the sslfiles Makefile target