Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby
Date: 2016-02-03 14:28:18
Message-ID: CAB7nPqTTGgYckpbqv8Hip+fypD99tSg2WnObhAJZMDh=GUBoNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Tue, Feb 2, 2016 at 1:42 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Sat, Jan 30, 2016 at 7:38 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
> wrote:
>>
>> On Fri, Jan 29, 2016 at 9:13 PM, Michael Paquier wrote:
>> > Well, to put it short, I am just trying to find a way to make the
>> > backend skip unnecessary checkpoints on an idle system, which results
>> > in the following WAL pattern if system is completely idle:
>> > CHECKPOINT_ONLINE
>> > RUNNING_XACTS
>> > RUNNING_XACTS
>> > [etc..]
>> >
>> > The thing is that I am lost with the meaning of this condition to
>> > decide if a checkpoint should be skipped or not:
>> > if (prevPtr == ControlFile->checkPointCopy.redo &&
>> > prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
>> > {
>> > WALInsertLockRelease();
>> > LWLockRelease(CheckpointLock);
>> > return;
>> > }
>> > As at least one standby snapshot is logged before the checkpoint
>> > record, the redo position is never going to match the previous insert
>> > LSN, so checkpoints will never be skipped if wal_level >= hot_standby.
>> > Skipping such unnecessary checkpoints is what you would like to
>> > address, no? Because that's what I would like to do here first. And
>> > once we got that right, we could think about addressing the case where
>> > WAL segments are forcibly archived for idle systems.
>>
>> I have put a bit more of brain power into that, and finished with the
>> patch attached. A new field called chkpProgressLSN is attached to
>> XLogCtl, which is to the current insert position of the checkpoint
>> when wal_level <= archive, or the LSN position of the standby snapshot
>> taken after a checkpoint. The bgwriter code is modified as well so as
>> it uses this progress LSN and compares it with the current insert LSN
>> to determine if a standby snapshot should be logged or not. On an
>> instance of Postgres completely idle, no useless checkpoints, and no
>> useless standby snapshots are generated anymore.
>>
>
>
> @@ -8239,7 +8262,7 @@ CreateCheckPoint(int flags)
> if ((flags & (CHECKPOINT_IS_SHUTDOWN |
> CHECKPOINT_END_OF_RECOVERY |
> CHECKPOINT_FORCE)) == 0)
> {
> - if
> (prevPtr == ControlFile->checkPointCopy.redo &&
> + if (GetProgressRecPtr() == prevPtr &&
>
> prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
> {
>
> I think such a check won't consider all the WAL-activity happened
> during checkpoint (after checkpoint start, till it finishes) which was
> not the intention of this code without your patch.

Yes, that's true, in v2 this progress LSN can be updated in two ways:
- At the position where checkpoint has begun
- At the position where standby snapshot was taken after a checkpoint has run.
Honestly, even if we log the snapshot for example before
CheckpointGuts(), that's not going to make it... The main issue here
is that there will be a snapshot after this checkpoint, so the
existing condition is a rather broken concept already when wal_level
>= hot_standby because the redo pointer is never going to match the
previous LSN pointer. Another idea would be to ensure that the
snapshot is logged just after the redo pointer, this would be
reliable.

> I think both this and previous patch (hs-checkpoints-v1) approach
> won't fix the issue in all kind of scenario's.
>
> Let me try to explain what I think should fix this issue based on
> discussion above, suggestions by Andres and some of my own
> thoughts:
>
> Have a new variable in WALInsertLock that would indicate the last
> insertion position (last_insert_pos) we write to after holding that lock.
> Ensure that we don't update last_insert_pos while logging standby
> snapshot (simple way is to pass a flag in XLogInsert or distinguish
> it based on type of record (XLOG_RUNNING_XACTS) or if you can
> think of a more smarter way).

Simplest way is in XLogInsertRecord, the record data is available
there, there is for example some logic related to segment switches.

> Now both during checkpoint and
> in bgwriter, to find out whether there is any activity since last
> time, we need to check all the WALInsertLocks for latest insert position
> (by referring last_insert_pos) and compare it with the latest position
> we have noted during last such check and decide whether to proceed
> or not based on comparison result. If you think it is not adequate to
> store last_insert_pos in WALInsertLock, then we can think of having
> it in PGPROC.

So the progress check is used when deciding if a checkpoint should be
skipped or not, and when deciding if a standby snapshot should be
taken by the bgwriter? When bgwriter logs a snapshot, it will update
as well the last LSN found (which would be a single variable in for
example XLogCtlData, updated from the data taken from the WAL insert
slots). But there is a problem here: if there is no activity since the
last bgwriter snapshot, the next checkpoint would be skipped as well.
It seems to me that this is not acceptable, a checkpoint generation
would be decided based on if there has been some data activity since
the last snapshot taken, or to put it in other words, no checkpoints
would happen if there has been no activity within 15s after the last
standby snapshot has been logged by the bgwriter.

I have hacked something according to those lines, please see attached
(logs are just here for testing purposes). I may be missing something
obvious regarding the tracking of the last progress position. Code
speaks better than words here I think, so feel free to look at it.

> Yet another idea that occurs to me this morning is that why not
> have a variable in shared memory in XLogCtlInsert or XLogCtl
> similar to CurrBytePos/PrevBytePos which will be updated on
> each XLOGInsert apart from the XLOGInsert for standby snapshots
> and use that in a patch somewhat close to what you have in
> hs-checkpoints-v1.

I have considered that as well, but I do not think it is a good idea
to add more contention in ReserveXLogInsertLocation() while taking
insertpos_lck lock. That's already really hot, and we surely don't
want to make it hotter just to do checks on idle systems. If that's
what you had in mind. Btw, this is basically what I did in the
attached, no? Except that the current insert position is tracked
differently.

From v2 I sent upthread:
- if (prevPtr == ControlFile->checkPointCopy.redo &&
+ if (GetProgressRecPtr() == prevPtr &&
prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
This is wrong actually. prevPtr should be replaced by the redo LSN.
--
Michael

Attachment Content-Type Size
hs-checkpoints-v3.patch text/x-diff 8.6 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message hein 2016-02-03 15:17:23 BUG #13909: String concat error with CITEXT after 9.5.0 upgrade.
Previous Message Michael Wimmer 2016-02-03 13:27:05 Java ClassCastException using JDBC driver and selecting smallint arrays

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-02-03 14:28:24 Re: "using previous checkpoint record at" maybe not the greatest idea?
Previous Message Robert Haas 2016-02-03 14:28:08 Re: extend pgbench expressions with functions