Re: XLog changes for 9.3

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: XLog changes for 9.3
Date: 2012-06-07 15:35:11
Message-ID: 4FD0CA2F.50601@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07.06.2012 17:18, Andres Freund wrote:
> On Thursday, June 07, 2012 03:50:35 PM Heikki Linnakangas wrote:
>> 3. Move the only field, xl_rem_len, from the continuation record header
>> straight to the xlog page header, eliminating XLogContRecord altogether.
>> This makes it easier to calculate in advance how much space a WAL record
>> requires, as it no longer depends on how many pages it has to be split
>> across. This wastes 4-8 bytes on every xlog page, but that's not much.
> +1. I don't think this will waste a measureable amount in real-world
> scenarios. A very big percentag of pages have continuation records.

Yeah, although the way I'm planning to do it, you'll waste 4 bytes (on
64-bit architectures) even when there is a continuation record, because
of alignment:

typedef struct XLogPageHeaderData
{
uint16 xlp_magic; /* magic value for correctness checks */
uint16 xlp_info; /* flag bits, see below */
TimeLineID xlp_tli; /* TimeLineID of first record on
XLogRecPtr xlp_pageaddr; /* XLOG address of this page */

+ uint32 xlp_rem_len; /* bytes remaining of continued record */
} XLogPageHeaderData;

The page header is currently 16 bytes in length, so adding a 4-byte
field to it bumps the aligned size to 24 bytes. Nevertheless, I think we
can well live with that.

>> 4. Allow WAL record header to be split across page boundaries.
>> Currently, if there are less than SizeOfXLogRecord bytes left on the
>> current WAL page, it is wasted, and the next record is inserted at the
>> beginning of the next page. The problem with that is again that it makes
>> it impossible to know in advance exactly how much space a WAL record
>> requires, because it depends on how many bytes need to be wasted at the
>> end of current page.
> +0.5. Its somewhat convenient to be able to look at a record before you have
> reassembled it over multiple pages. But its probably not worth the
> implementation complexity.

Looking at the code, I think it'll be about the same complexity for
XLogInsert in its current form (it will help the patch I'm working on),
and makes ReadRecord() a bit more complicated. But not much.

> If we do that we can remove all the aligment padding as well. Which would be a
> problem for you anyway, wouldn't it?

It's not a problem. You just MAXALIGN the size of the record when you
calculate how much space it needs, and then all records become naturally
MAXALIGNed. We could quite easily remove the alignment on-disk if we
wanted to, ReadRecord() already always copies the record to an aligned
buffer, but I wasn't planning to do that.

>> These changes will help the XLogInsert scaling patch, by making the
>> space calculations simpler. In essence, to reserve space for a WAL
>> record of size X, you just need to do "bytepos += X". There's a lot
>> more details with that, like mapping from the contiguous byte position
>> to an XLogRecPtr that takes page headers into account, and noticing
>> RedoRecPtr changes safely, but it's a start.
> Hm. Wouldn't you need to remove short/long page headers for that as well?

No, those are ok because they're predictable. Although it would make the
mapping simpler. To convert from a contiguous xlog byte position that
excludes all headers, to XLogRecPtr, you need to do something like this
(I just made this up, probably has bugs, but it's about this complex):

#define UsableBytesInPage (XLOG_BLCKSZ - SizeOfXLogShortPHD)
#define UsableBytesInSegment ((XLOG_SEG_SIZE / XLOG_BLCKSZ) *
UsableBytesInPage - (SizeOfXLogLongPHD - SizeOfXLogShortPHD)

uint64 xlogrecptr;
uint64 full_segments = bytepos / UsableBytesInSegment;
int offset_in_segment = bytepos % UsableBytesInSegment;

xlogrecptr = full_segments * XLOG_SEG_SIZE;
/* is it on the first page? */
if (offset_in_segment < XLOG_BLCKSZ - SizeOfXLogLongPHD)
xlogrecptr += SizeOfXLogLongPHD + offset_in_segment;
else
{
/* first page is fully used */
xlogrecptr += XLOG_BLCKSZ;
/* add other full pages */
offset_in_segment -= XLOG_BLCKSZ - SizeOfXLogLongPHD;
xlogrecptr += (offset_in_segment / UsableBytesInPage) * XLOG_BLCKSZ;
/* and finally offset within the last page */
xlogrecptr += offset_in_segment % UsableBytesInPage;
}
/* finally convert the 64-bit xlogrecptr to a XLogRecPtr struct */
XLogRecPtr.xlogid = xlogrecptr >> 32;
XLogRecPtr.xrecoff = xlogrecptr & 0xffffffff;

Capsulated in a function, that's not too bad. But if we want to make
that simpler, one idea would be to allocate the whole 1st page in each
WAL segment for metadata. That way all the actual xlog pages would hold
the same amount of xlog data.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Honza Horak 2012-06-07 15:47:20 Re: Ability to listen on two unix sockets
Previous Message Robert Haas 2012-06-07 15:24:23 Re: Could we replace SysV semaphores with latches?