Re: WAL format and API changes (9.5)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: WAL format and API changes (9.5)
Date: 2014-11-06 15:32:33
Message-ID: 545B9491.6040409@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Replying to some of your comments below. The rest were trivial issues
that I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:
> * Is it really a good idea to separate XLogRegisterBufData() from
> XLogRegisterBuffer() the way you've done it ? If we ever actually get
> a record with a large numbers of blocks touched this issentially is
> O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to find
the right registered_buffer struct, might be expensive? If that ever
becomes a problem, a simple fix would be to start the linear search from
the end, and make sure that when you touch a large number of blocks, you
do all the XLogRegisterBufData() calls right after the corresponding
XLogRegisterBuffer() call.

I've also though about having XLogRegisterBuffer() return the pointer to
the struct, and passing it as argument to XLogRegisterBufData. So the
pattern in WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have
potential to turn XLogRegisterBufData into a macro or inline function,
to eliminate the function call overhead. I played with that a little
bit, but the difference in performance was so small that it didn't seem
worth it. But passing the registered_buffer pointer, like above, might
not be a bad thing anyway.

> * There's lots of functions like XLogRecHasBlockRef() that dig through
> the wal record. A common pattern is something like:
> if (XLogRecHasBlockRef(record, 1))
> XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
> else
> oldblk = newblk;
>
> I think doing that repeatedly is quite a bad idea. We should parse the
> record once and then use it in a sensible format. Not do it in pieces,
> over and over again. It's not like we ignore backup blocks - so doing
> this lazily doesn't make sense to me.
> Especially as ValidXLogRecord() *already* has parsed the whole damn
> thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need a
fair amount of new infrastructure. Vast majority of WAL records contain
one, maybe two, block references, so it's not that expensive to find the
right one, even if you do it several times.

I ran a quick performance test on WAL replay performance yesterday. I
ran pgbench for 1000000 transactions with WAL archiving enabled, and
measured the time it took to replay the generated WAL. I did that with
and without the patch, and I didn't see any big difference in replay
times. I also ran "perf" on the startup process, and the profiles looked
identical. I'll do more comprehensive testing later, with different
index types, but I'm convinced that there is no performance issue in
replay that we'd need to worry about.

If it mattered, a simple optimization to the above pattern would be to
have XLogRecGetBlockTag return true/false, indicating if the block
reference existed at all. So you'd do:

if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
oldblk != newblk;

On the other hand, decomposing the WAL record into parts, and passing
the decomposed representation to the redo routines would allow us to
pack the WAL record format more tightly, as accessing the different
parts of the on-disk format wouldn't then need to be particularly fast.
For example, I've been thinking that it would be nice to get rid of the
alignment padding in XLogRecord, and between the per-buffer data
portions. We could copy the data to aligned addresses as part of the
decomposition or parsing of the WAL record, so that the redo routines
could still assume aligned access.

> * I wonder if it wouldn't be worthwile, for the benefit of the FPI
> compression patch, to keep the bkp block data after *all* the
> "headers". That'd make it easier to just compress the data.

Maybe. If we do that, I'd also be inclined to move all the bkp block
headers to the beginning of the WAL record, just after the XLogInsert
struct. Somehow it feels weird to have a bunch of header structs
sandwiched between the rmgr-data and per-buffer data. Also, 4-byte
alignment is enough for the XLogRecordBlockData struct, so that would be
an easy way to make use of the space currently wasted for alignment
padding in XLogRecord.

> * I think heap_xlog_update is buggy for wal_level=logical because it
> computes the length of the tuple using
> tuplen = recdataend - recdata;
> But the old primary key/old tuple value might be stored there as
> well. Afaics that code has to continue to use xl_heap_header_len.

No, the old primary key or tuple is stored in the main data portion.
That tuplen computation is done on backup block 0's data.

> * It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
> to get rid of:
> + /*
> + * The new tuple is normally stored as buffer 0's data. But if
> + * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
> + * data, after the xl_heap_insert struct.
> + */
> + if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
> + {
> + data = XLogRecGetData(record) + SizeOfHeapInsert;
> + datalen = record->xl_len - SizeOfHeapInsert;
> + }
> + else
> + data = XLogRecGetBlockData(record, 0, &datalen);
> or have I misunderstood how that works?

Ah, you're right. Actually, the code that writes the WAL record *does*
use REGBUF_KEEP_DATA. That was a bug in the redo routine, it should
always look into buffer 0's data.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-11-06 15:49:37 Re: group locking: incomplete patch, just for discussion
Previous Message Bernd Helmle 2014-11-06 15:26:28 Re: Amazon Redshift