Re: Spreading full-page writes

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Spreading full-page writes
Date: 2014-05-26 14:34:18
Message-ID: CAHGQGwFYCaqPkjxsBkyF8vyMJxBrbpgBf_=Q9gq2i_uWFVq=uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 26, 2014 at 6:52 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> Here's an idea I tried to explain to Andres and Simon at the pub last night,
> on how to reduce the spikes in the amount of WAL written at beginning of a
> checkpoint that full-page writes cause. I'm just writing this down for the
> sake of the archives; I'm not planning to work on this myself.
>
>
> When you are replaying a WAL record that lies between the Redo-pointer of a
> checkpoint and the checkpoint record itself, there are two possibilities:
>
> a) You started WAL replay at that checkpoint's Redo-pointer.
>
> b) You started WAL replay at some earlier checkpoint, and are already in a
> consistent state.
>
> In case b), you wouldn't need to replay any full-page images, normal
> differential WAL records would be enough. In case a), you do, and you won't
> be consistent until replaying all the WAL up to the checkpoint record.
>
> We can exploit those properties to spread out the spike. When you modify a
> page and you're about to write a WAL record, check if the page has the
> BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page
> against the *previous* checkpoints redo-pointer, instead of the one's that's
> currently in-progress. If no full-page image is required based on that
> comparison, IOW if the page was modified and a full-page image was already
> written after the earlier checkpoint, write a normal WAL record without
> full-page image and set a new flag in the buffer header (BM_NEEDS_FPW). Also
> set a new flag on the WAL record, XLR_FPW_SKIPPED.
>
> When checkpointer (or any other backend that needs to evict a buffer) is
> about to flush a page from the buffer cache that has the BM_NEEDS_FPW flag
> set, write a new WAL record, containing a full-page-image of the page,
> before flushing the page.

How does this mechanism work during base backup? pg_stop_backup needs
to flush all buffers with BM_NEEDS_FPW flag?

>
> Here's how this works out during replay:
>
> a) You start WAL replay from the latest checkpoint's Redo-pointer.
>
> When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't
> replay that record at all. It's OK because we know that there will be a
> separate record containing the full-page image of the page later in the
> stream.
>
> b) You are continuing WAL replay that started from an earlier checkpoint,
> and have already reached consistency.
>
> When you see a WAL record that's been marked with XLR_FPW_SKIPPED, replay it
> normally. It's OK, because the flag means that the page was modified after
> the earlier checkpoint already, and hence we must have seen a full-page
> image of it already. When you see one of the WAL records containing a
> separate full-page-image, ignore it.
>
> This scheme make the b-case behave just as if the new checkpoint was never
> started. The regular WAL records in the stream are identical to what they
> would've been if the redo-pointer pointed to the earlier checkpoint. And the
> additional FPW records are simply ignored.
>
> In the a-case, it's not be safe to replay the records marked with
> XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the usual
> torn-page hazards that comes with that. However, the separate FPW records
> that come later in the stream will fix-up those pages.
>
>
> Now, I'm sure there are issues with this scheme I haven't thought about, but
> I wanted to get this written down. Note this does not reduce the overall WAL
> volume - on the contrary - but it ought to reduce the spike.

ISTM that this can increase WAL volume because one data change can
generate both normal WAL and FPW. No?

Regards,

--
Fujii Masao

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-05-26 14:39:13 Re: Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
Previous Message ash 2014-05-26 14:25:09 Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?