Re: archive status ".ready" files may be created too early

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: hlinnaka(at)iki(dot)fi
Cc: matsumura(dot)ryo(at)fujitsu(dot)com, bossartn(at)amazon(dot)com, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: archive status ".ready" files may be created too early
Date: 2020-10-14 00:06:28
Message-ID: 20201014.090628.839639906081252194.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for visiting this thread.

At Mon, 12 Oct 2020 15:04:40 +0300, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote in
> On 07/07/2020 12:02, matsumura(dot)ryo(at)fujitsu(dot)com wrote:
> > At Monday, July 6, 2020 05:13:40 +0000, "Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
> >>>> after WAL buffer is filled up to the requested position. So when it
> >>>> crosses segment boundary we know the all past corss segment-boundary
> >>>> records are stable. That means all we need to remember is only the
> >>>> position of the latest corss-boundary record.
> >>>
> >>> I could not agree. In the following case, it may not work well.
> >>> - record-A and record-B (record-B is a newer one) is copied, and
> >>> - lastSegContRecStart/End points to record-B's, and
> >>> - FlushPtr is proceeded to in the middle of record-A.
> >>
> >> IIUC, that means record-B is a cross segment-border record and we hav
> >> e
> >> flushed beyond the recrod-B. In that case crash recovery afterwards
> >> can read the complete record-B and will finish recovery *after* the
> >> record-B. That's what we need here.
> > I'm sorry I didn't explain enough.
> > Record-A and Record-B are cross segment-border records.
> > Record-A spans segment X and X+1
> > Record-B spans segment X+2 and X+3.
> > If both records have been inserted to WAL buffer,
> > lastSegContRecStart/End points to Record-B.
> > If a writer flushes upto the middle of segment-X+1,
> > NotifyStableSegments() allows the writer to notify segment-X.
> > Is my understanding correct?
>
> I think this little ASCII drawing illustrates the above scenario:
>
> AAAAA F BBBBB
> |---------|---------|---------|
> seg X seg X+1 seg X+2
>
> AAAAA and BBBBB are Record-A and Record-B. F is the current flush
> pointer.

I modified the figure a bit for the explanation below.

F0 F1
AAAAA F BBBBB
|---------|---------|---------|
seg X seg X+1 seg X+2

Matsumura-san has a concern about the case where there are two (or
more) partially-flushed segment-spanning records at the same time.

This patch remembers only the last cross-segment record. If we were
going to flush up to F0 after Record-B had been written, we would fail
to hold-off archiving seg-X. This patch is based on a assumption that
that case cannot happen because we don't leave a pending page at the
time of segment switch and no records don't span over three or more
segments.

> In this case, it would be OK to notify segment X, as long as F is
> greater than the end of record A. And if I'm reading Kyotaro's patch
> correctly, that's what would happen with the patch.
>
> The patch seems correct to me. I'm a bit sad that we have to track yet
> another WAL position (two, actually) to fix this, but I don't see a
> better way.

Is the two means Record-A and B? Is it needed even with having the
assumption above?

> I wonder if we should arrange things so that XLogwrtResult.Flush never
> points in the middle of a record? I'm not totally convinced that all

That happens at good percentage of page-boundary. And a record can
span over three or more pages. Do we need to avoid all such cases?

I did that only for the cross-segment case.

> the current callers of GetFlushRecPtr() are OK with a middle-of-WAL
> record value. Could we get into similar trouble if a standby
> replicates half of a cross-segment record to a cascaded standby, and
> the cascaded standby has WAL archiving enabled?

The patch includes a fix for primary->standby case. But I'm not sure
we can do that in the cascaded case. A standby is not aware of the
structure of a WAL blob and has no idea of up-to-where to send the
received blobs. However, if we can rely on the behavior of CopyData
that we always receive a blob as a whole sent from the sender at once,
the cascaded standbys are free from the trouble (as far as the
cascaded-standby doesn't crash just before writing the last-half of a
record into pg_wal and after archiving the last full-segment, which
seems unlikely.).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v2-0001-Avoid-archiving-a-WAL-segment-that-continues-to-t.patch text/x-patch 9.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2020-10-14 00:40:13 Re: [HACKERS] Runtime Partition Pruning
Previous Message Justin Pryzby 2020-10-13 22:55:48 Re: Add session statistics to pg_stat_database