tackling full page writes

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: tackling full page writes
Date: 2011-05-24 20:34:29
Message-ID: BANLkTimhopkDvD2y_S-0Kf874ueX-gQD8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While eating good Indian food and talking about aviation accidents on
the last night of PGCon, Greg Stark, Heikki Linnakangas, and I found
some time to brainstorm about possible ways to reduce the impact of
full_page_writes. I'm not sure that these ideas are much good, but
for the sake of posterity:

1. Heikki suggested that instead of doing full page writes, we might
try to write only the parts of the page that have changed. For
example, if we had 16 bits to play with in the page header (which we
don't), then we could imagine the page as being broken up into 16
512-byte chunks, one per bit. Each time we update the page, we write
whatever subset of the 512-byte chunks we're actually modifying,
except for any that have been written since the last checkpoint. In
more detail, when writing a WAL record, if a checkpoint has intervened
since the page LSN, then we first clear all 16 bits, reset the bits
for the chunks we're modifying, and XLOG those chunks. If no
checkpoint has intervened, then we set the bits for any chunks that we
are modifying and for which the corresponding bits aren't yet set; and
XLOG the corresponding chunks. As I think about it a bit more, we'd
need to XLOG not only the parts of the page we actually modifying, but
any that the WAL record would need to be correct on replay.

(It was further suggested that, in our grand tradition of bad naming,
we could name this feature "partial full page writes" and enable it
either with a setting of full_page_writes=partial, or better yet, add
a new GUC partial_full_page_writes. The beauty of the latter is that
it's completely ambiguous what happens when full_page_writes=off and
partial_full_page_writes=on. Actually, we could invert the sense and
call it disable_partial_full_page_writes instead, which would probably
remove all hope of understanding. This all seemed completely
hilarious when we were talking about it, and we weren't even drunk.)

2. The other fairly obvious alternative is to adjust our existing WAL
record types to be idempotent - i.e. to not rely on the existing page
contents. For XLOG_HEAP_INSERT, we currently store the target tid and
the tuple contents. I'm not sure if there's anything else, but we
would obviously need the offset where the new tuple should be written,
which we currently infer from reading the existing page contents. For
XLOG_HEAP_DELETE, we store just the TID of the target tuple; we would
certainly need to store its offset within the block, and maybe the
infomask. For XLOG_HEAP_UPDATE, we'd need the old and new offsets and
perhaps also the old and new infomasks. Assuming that's all we need
and I'm not missing anything (which I won't bet on), that means we'd
be adding, say, 4 bytes per insert or delete and 8 bytes per update.
So, if checkpoints are spread out widely enough that there will be
more than ~2K operations per page between checkpoints, then it makes
more sense to just do a full page write and call it good. If not,
this idea might have legs.

3. Going a bit further, Greg proposed the idea of ripping out our
current WAL infrastructure altogether and instead just having one WAL
record that says "these byte ranges on this page changed to have these
new contents". That's elegantly simple, but I'm afraid it would bloat
the records quite a bit. For example, as Heikki pointed out,
HEAP_XLOG_DELETE relies on the XID in the record header to figure out
what to write, and all the heap-modification operations implicitly
specify the visibility map change when they specify the heap change.
We currently have a flag to indicate whether the visibility map
actually requires an update, but it's just one bit. However, one
possible application of this concept is that we could add something
like this in along with our existing WAL record types. It might be
useful, for example, for third-party index AMs, which are currently
pretty much out of luck.

That's about as far as we got. Though I haven't convinced anyone else
yet, I still think there's some merit to the idea of just writing the
portion of the page that precedes pd_upper. WAL records would have to
assume that the tuple data might be clobbered, but they could rely on
the early portion of the page to be correct. AFAICT, that would be OK
for all of the existing WAL records except for XLOG_HEAP2_CLEAN (i.e.
vacuum), with the exception that - prior to the minimum recovery point
- they'd need to apply their changes unconditionally rather than
considering the page LSN. Tom has argued that won't work, but I'm not
sure he's convinced anyone else yet...

Anyone else have good ideas?

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2011-05-24 20:36:52 Re: about EDITOR_LINENUMBER_SWITCH
Previous Message Michael Nolan 2011-05-24 20:34:23 New/Revised TODO? Gathering actual read performance data for use by planner