Spreading full-page writes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Spreading full-page writes
Date: 2014-05-25 21:52:20
Message-ID: 53826614.7040809@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Here's an idea I tried to explain to Andres and Simon at the pub last
night, on how to reduce the spikes in the amount of WAL written at
beginning of a checkpoint that full-page writes cause. I'm just writing
this down for the sake of the archives; I'm not planning to work on this
myself.

When you are replaying a WAL record that lies between the Redo-pointer
of a checkpoint and the checkpoint record itself, there are two
possibilities:

a) You started WAL replay at that checkpoint's Redo-pointer.

b) You started WAL replay at some earlier checkpoint, and are already in
a consistent state.

In case b), you wouldn't need to replay any full-page images, normal
differential WAL records would be enough. In case a), you do, and you
won't be consistent until replaying all the WAL up to the checkpoint record.

We can exploit those properties to spread out the spike. When you modify
a page and you're about to write a WAL record, check if the page has the
BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page
against the *previous* checkpoints redo-pointer, instead of the one's
that's currently in-progress. If no full-page image is required based on
that comparison, IOW if the page was modified and a full-page image was
already written after the earlier checkpoint, write a normal WAL record
without full-page image and set a new flag in the buffer header
(BM_NEEDS_FPW). Also set a new flag on the WAL record, XLR_FPW_SKIPPED.

When checkpointer (or any other backend that needs to evict a buffer) is
about to flush a page from the buffer cache that has the BM_NEEDS_FPW
flag set, write a new WAL record, containing a full-page-image of the
page, before flushing the page.

Here's how this works out during replay:

a) You start WAL replay from the latest checkpoint's Redo-pointer.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't
replay that record at all. It's OK because we know that there will be a
separate record containing the full-page image of the page later in the
stream.

b) You are continuing WAL replay that started from an earlier
checkpoint, and have already reached consistency.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED,
replay it normally. It's OK, because the flag means that the page was
modified after the earlier checkpoint already, and hence we must have
seen a full-page image of it already. When you see one of the WAL
records containing a separate full-page-image, ignore it.

This scheme make the b-case behave just as if the new checkpoint was
never started. The regular WAL records in the stream are identical to
what they would've been if the redo-pointer pointed to the earlier
checkpoint. And the additional FPW records are simply ignored.

In the a-case, it's not be safe to replay the records marked with
XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the
usual torn-page hazards that comes with that. However, the separate FPW
records that come later in the stream will fix-up those pages.

Now, I'm sure there are issues with this scheme I haven't thought about,
but I wanted to get this written down. Note this does not reduce the
overall WAL volume - on the contrary - but it ought to reduce the spike.

- Heikki

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-05-25 22:22:11 Re: 9.4 btree index corruption
Previous Message Jeff Janes 2014-05-25 21:45:38 Re: 9.4 btree index corruption