Re: Multiple full page writes in a single checkpoint?

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multiple full page writes in a single checkpoint?
Date: 2021-02-04 00:21:25
Message-ID: 20210204002125.GC11069@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 3, 2021 at 03:29:13PM -0800, Andres Freund wrote:
> > Is the above case valid, and would it cause two full page writes to WAL?
> > More specifically, wouldn't it cause every write of the page to the file
> > system to use a new LSN?
>
> No. 8) won't happen. Look e.g. at XLogSaveBufferForHint():
>
> /*
> * Update RedoRecPtr so that we can make the right decision
> */
> RedoRecPtr = GetRedoRecPtr();
>
> /*
> * We assume page LSN is first data on *every* page that can be passed to
> * XLogInsert, whether it has the standard page layout or not. Since we're
> * only holding a share-lock on the page, we must take the buffer header
> * lock when we look at the LSN.
> */
> lsn = BufferGetLSNAtomic(buffer);
>
> if (lsn <= RedoRecPtr)
> /* wal log hint bit */
>
> The RedoRecPtr is determined at 1. and doesn't change between 4) and
> 8). The LSN for 4) has to be *past* the RedoRecPtr from 1). Therefore we
> don't do another FPW.

OK, so, what is happening is that it knows the page LSN is after the
start of the current checkpoint (the redo point), so it knows not do to
a full page write again? Smart, and makes sense.

> Changing this is *completely* infeasible. In a lot of workloads it'd
> cause a *massive* explosion of WAL volume. Like quadratically. You'll
> need to find another way to generate a nonce.

Do we often do multiple writes to the file system of the same page
during a single checkpoint, particularly only-hint-bit-modified pages?
I didn't think so.

> In the non-hint bit case you'll automatically have a higher LSN in 7/8
> though. So you won't need to do anything about getting a higher nonce.

Yes, I was counting on that. :-)

> For the hint bit case in 8 you could consider just using any LSN generated
> after 4 (preferably already flushed to disk) - but that seems somewhat
> ugly from a debuggability POV :/. Alternatively you could just create
> tiny WAL record to get a new LSN, but that'll sometimes trigger new WAL
> flushes when the pages are dirtied.

Yes, that would make sense. I do need the first full page write during
a checkpoint to be sure I don't have torn pages that have some part of
the page encrypted with one LSN and a second part with a different LSN.
You are right that I don't need a second full page write during the same
checkpoint because a torn page would just restore the first full page
write and throw away the second LSN and hint bit changes, which is fine.

I hadn't gotten to ask about that until I found if the previous
assumptions were true, which they were not.

Is the logical approach here to modify XLogSaveBufferForHint() so if a
page write is not needed, to create a dummy WAL record that just
increments the WAL location and updates the page LSN? (Is there a small
WAL record I should reuse?) I can try to add a hint-bit-page-write page
counter, but that might overflow, and then we will need a way to change
the LSN anyway.

I am researching this so I can give a clear report on the impact of
adding this feature. I will update the wiki once we figure this out.

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EDB https://enterprisedb.com

The usefulness of a cup is in its emptiness, Bruce Lee

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2021-02-04 00:40:26 Re: WIP: WAL prefetch (another approach)
Previous Message Zhihong Yu 2021-02-04 00:17:02 Re: Dumping/restoring fails on inherited generated column