Re: Fix calculations on WAL recycling.

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: michael(at)paquier(dot)xyz
Cc: pgsql-hackers(at)postgresql(dot)org, hlinnaka(at)iki(dot)fi
Subject: Re: Fix calculations on WAL recycling.
Date: 2018-07-23 10:55:57
Message-ID: 20180723.195557.114019130.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 23 Jul 2018 15:59:16 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in <20180723065916(dot)GI2854(at)paquier(dot)xyz>
> On Mon, Jul 23, 2018 at 01:57:48PM +0900, Kyotaro HORIGUCHI wrote:
> > I'll register this to the next commitfest.
> >
> > https://www.postgresql.org/message-id/20180719.125117.155470938.horiguchi.kyotaro@lab.ntt.co.jp
>
> This is an open item for v11.

Mmm. Thanks. I wrongly thought this was v10 item. Removed this
from the next CF.

Thank you for taking a look.

> >> While considering this, I found a bug in 4b0d28de06, which
> >> removed prior checkpoint from control file. It actually trims the
> >> segments before the last checkpoint's redo segment but recycling
> >> is still considered based on the *prevous* checkpoint. As the
> >> result min_wal_size doesn't work as told. Specifically, setting
> >> min/max_wal_size to 48MB and advance four or more segments then
> >> two checkpoints leaves just one segment, which is less than
> >> min_wal_size.
> >>
> >> The attached patch fixes that. One arguable point on this would
> >> be the removal of the behavior when RemoveXLogFile(name,
> >> InvalidXLogRecPtr, ..).
> >>
> >> The only place calling the function with the parameter is
> >> timeline switching. Previously unconditionally 10 segments are
> >> recycled after switchpoint but the reason for the behavior is we
> >> didn't have the information on previous checkpoint at hand at the
> >> time. But now we can use the timeline switch point as the
> >> approximate of the last checkpoint's redo point and this allows
> >> us to use min/max_wal_size properly at the time.
>
> I have been looking at that, and tested with this simple scenario:
> create table aa (a int);
>
> Then just repeat the following SQLs:
> insert into aa values (1);
> select pg_switch_wal();
> checkpoint;
> select pg_walfile_name(pg_current_wal_lsn());
>
> By doing so, you would notice that the oldest WAL segment does not get
> recycled after the checkpoint, while it should as the redo pointer is
> always checkpoint generated. What happens is that this oldest segment
> gets recycled every two checkpoints.

(I'm not sure I'm getting it correctly..) I see the old segments
recycled. When I see pg_switch_wal() returns 0/12..,
pg_walfile_name is ...13 and segment files for 13 and 14 are
found in pg_wal directory. That is, seg 12 was recycled as seg
14. log_checkpoint=on shows every checkpoint recycles 1 segment
in the case.

> Then I had a look at the proposed patch, with a couple of comments.
>
> - if (PriorRedoPtr == InvalidXLogRecPtr)
> - recycleSegNo = endlogSegNo + 10;
> - else
> - recycleSegNo = XLOGfileslop(PriorRedoPtr);
> + recycleSegNo = XLOGfileslop(RedoRecPtr);
> I think that this is a new behavior, and should not be changed, see
> point 3 below.
>
> In CreateCheckPoint(), the removal of past WAL segments is always going
> to happen as RedoRecPtr is never InvalidXLogRecPtr, so the recycling
> should happen also when PriorRedoPtr is InvalidXLogRecPtr, no?

Yes. The reason for the change was the change of
RemoveNonParentXlogFiles that I made and it is irrelevant to this
bug fix (and rather breaking as you pointed..). So, reverted it.

> /* Trim from the last checkpoint, not the last - 1 */
> This comment could be adjusted, let's remove "not the last - 1" .

Oops! Thanks. The comment has finally vanished and melded into
another comment just above.

| * Delete old log files not required by the last checkpoint and recycle
| * them

> The calculation of _logSegNo in CreateRestartPoint is visibly incorrect,
> this still uses PriorRedoPtr so the bug is not fixed for standbys. The
> tweaks for ThisTimeLineID also need to be out of the loop where
> PriorRedoPtr is InvalidXLogRecPtr, and only the distance calculation
> should be kept.

Agreed. While I reconsidered this, I noticed that the estimated
checkpoint distance is 0 when PriorRedoPtr is invalid. This means
that the first checkpoint/restartpoint tries to reduce WAL
segments to min_wal_size. However, it happens only at initdb time
and makes no substantial behavior change so I made the change
ignoring the difference.

> Finally, in summary, this patch is doing actually three things:
> 1) Rename a couple of variables which refer incorrectly to the prior
> checkpoint so as they refer to the last checkpoint generated.
> 2) Make sure that WAL recycling/removal happens based on the last
> checkpoint generated, which is the one just generated when past WAL
> segments are cleaned up as post-operation actions.
> 3) Enforce the recycling point to be the switch point instead of
> arbitrarily recycle 10 segments, which is what b2a5545b has introduced.
> Recycling at exactly the switch point is wrong, as the redo LSN of the
> previous checkpoint may not be at the same segment when a timeline has
> switched, so you would finish with recycling segments which are still
> needed if an instance needs be crash-recovered post-promotion without
> a completed post-recovery checkpoint. In short this is dangerous.
> I'll let Heikki comment on that, but I think that you should fetch the
> last redo LSN instead, still you need to be worried about checkpoints
> running in parallel of the startup process, so I would stick with the
> current logic.

Thank you for the detail. I was coufused a bit there. I agree to
preserve the logic, too.

> 1) and 2) are in my opinion clearly oversights from 4b0d28d, but 3) is
> not.

Thank you for the comments and suggestions. After all, I did the
following things in the attached patch.

- Reverted the change on timeline switching. (Removed the (3))

- Fixed CreateRestartPoint to do the same thing with CreateCheckPoint.

- Both CreateRestart/CheckPoint now tries trimming of WAL
segments even for the first time.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
v3-0001-Fix-calculation-base-of-WAL-recycling.patch text/x-patch 10.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2018-07-23 10:58:11 Re: [Proposal] Add accumulated statistics for wait event
Previous Message Darafei Komяpa Praliaskouski 2018-07-23 10:42:02 Re: JIT breaks PostGIS