When I worked on the XLogInsert scaling patch, it became apparent that
some changes to the WAL format would make it a lot easier. So for 9.3,
I'd like to do some refactoring:
1. Use a 64-bit integer instead of the two-variable log/seg
representation, for identifying a WAL segment. This has no user-visible
effect, but makes the code a bit simpler.
2. Don't waste the last WAL segment in each logical 4GB file. Currently,
we skip the WAL segment ending with "FF". The comments claim that
wasting the last segment "ensures that we don't have problems
representing last-byte-position-plus-1", but in my experience, it just
makes things more complicated. You have two ways to represent the
segment boundary, and some functions are picky on which one is used. For
example, XLogWrite() assumes that when you want to flush to the end of a
logical log file, you use the "5/FF000000" representation, not
"6/00000000". Other functions, like XLogPageRead(), expect the latter.
This is a backwards-incompatible change for external utilities that know
how the WAL segment numbering works. Hopefully there aren't too many of
3. Move the only field, xl_rem_len, from the continuation record header
straight to the xlog page header, eliminating XLogContRecord altogether.
This makes it easier to calculate in advance how much space a WAL record
requires, as it no longer depends on how many pages it has to be split
across. This wastes 4-8 bytes on every xlog page, but that's not much.
4. Allow WAL record header to be split across page boundaries.
Currently, if there are less than SizeOfXLogRecord bytes left on the
current WAL page, it is wasted, and the next record is inserted at the
beginning of the next page. The problem with that is again that it makes
it impossible to know in advance exactly how much space a WAL record
requires, because it depends on how many bytes need to be wasted at the
end of current page.
These changes will help the XLogInsert scaling patch, by making the
space calculations simpler. In essence, to reserve space for a WAL
record of size X, you just need to do "bytepos += X". There's a lot
more details with that, like mapping from the contiguous byte position
to an XLogRecPtr that takes page headers into account, and noticing
RedoRecPtr changes safely, but it's a start.
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2012-06-07 13:56:52|
|Subject: Re: slow dropping of tables, DropRelFileNodeBuffers, tas|
|Previous:||From: Andres Freund||Date: 2012-06-07 13:41:51|
|Subject: Re: "page is not marked all-visible" warning in regression tests|