Re: Fix calculations on WAL recycling.

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Fix calculations on WAL recycling.
Date: 2018-07-23 06:59:16
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 23, 2018 at 01:57:48PM +0900, Kyotaro HORIGUCHI wrote:
> I'll register this to the next commitfest.

This is an open item for v11.

>> While considering this, I found a bug in 4b0d28de06, which
>> removed prior checkpoint from control file. It actually trims the
>> segments before the last checkpoint's redo segment but recycling
>> is still considered based on the *prevous* checkpoint. As the
>> result min_wal_size doesn't work as told. Specifically, setting
>> min/max_wal_size to 48MB and advance four or more segments then
>> two checkpoints leaves just one segment, which is less than
>> min_wal_size.
>> The attached patch fixes that. One arguable point on this would
>> be the removal of the behavior when RemoveXLogFile(name,
>> InvalidXLogRecPtr, ..).
>> The only place calling the function with the parameter is
>> timeline switching. Previously unconditionally 10 segments are
>> recycled after switchpoint but the reason for the behavior is we
>> didn't have the information on previous checkpoint at hand at the
>> time. But now we can use the timeline switch point as the
>> approximate of the last checkpoint's redo point and this allows
>> us to use min/max_wal_size properly at the time.

I have been looking at that, and tested with this simple scenario:
create table aa (a int);

Then just repeat the following SQLs:
insert into aa values (1);
select pg_switch_wal();
select pg_walfile_name(pg_current_wal_lsn());

By doing so, you would notice that the oldest WAL segment does not get
recycled after the checkpoint, while it should as the redo pointer is
always checkpoint generated. What happens is that this oldest segment
gets recycled every two checkpoints.

Then I had a look at the proposed patch, with a couple of comments.

- if (PriorRedoPtr == InvalidXLogRecPtr)
- recycleSegNo = endlogSegNo + 10;
- else
- recycleSegNo = XLOGfileslop(PriorRedoPtr);
+ recycleSegNo = XLOGfileslop(RedoRecPtr);
I think that this is a new behavior, and should not be changed, see
point 3 below.

In CreateCheckPoint(), the removal of past WAL segments is always going
to happen as RedoRecPtr is never InvalidXLogRecPtr, so the recycling
should happen also when PriorRedoPtr is InvalidXLogRecPtr, no?

/* Trim from the last checkpoint, not the last - 1 */
This comment could be adjusted, let's remove "not the last - 1" .

The calculation of _logSegNo in CreateRestartPoint is visibly incorrect,
this still uses PriorRedoPtr so the bug is not fixed for standbys. The
tweaks for ThisTimeLineID also need to be out of the loop where
PriorRedoPtr is InvalidXLogRecPtr, and only the distance calculation
should be kept.

Finally, in summary, this patch is doing actually three things:
1) Rename a couple of variables which refer incorrectly to the prior
checkpoint so as they refer to the last checkpoint generated.
2) Make sure that WAL recycling/removal happens based on the last
checkpoint generated, which is the one just generated when past WAL
segments are cleaned up as post-operation actions.
3) Enforce the recycling point to be the switch point instead of
arbitrarily recycle 10 segments, which is what b2a5545b has introduced.
Recycling at exactly the switch point is wrong, as the redo LSN of the
previous checkpoint may not be at the same segment when a timeline has
switched, so you would finish with recycling segments which are still
needed if an instance needs be crash-recovered post-promotion without
a completed post-recovery checkpoint. In short this is dangerous.
I'll let Heikki comment on that, but I think that you should fetch the
last redo LSN instead, still you need to be worried about checkpoints
running in parallel of the startup process, so I would stick with the
current logic.

1) and 2) are in my opinion clearly oversights from 4b0d28d, but 3) is

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2018-07-23 07:00:54 Re: Non-portable shell code in pg_upgrade tap tests
Previous Message Prabhat Sahu 2018-07-23 06:49:12 Memory leak with CALL to Procedure with COMMIT.