Re: WAL format and API changes (9.5)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: WAL format and API changes (9.5)
Date: 2014-10-30 17:52:47
Message-ID: 54527AEF.20506@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/30/2014 06:02 PM, Andres Freund wrote:
> On 2014-10-29 10:24:20 +0200, Heikki Linnakangas wrote:
>> On 10/06/2014 02:29 PM, Andres Freund wrote:
>>> I've not yet really looked,
>>> but on a quick readthrough XLogInsertRecData() staying in xlog.c doesn't
>>> make me happy...
>>
>> Ok.. Can you elaborate?
>
> To me the split between xloginsert.c doing some of the record assembly,
> and xlog.c doing the lower level part of the assembly is just wrong.

Really? To me, that feels like a natural split. xloginsert.c is
responsible for constructing the final WAL record, with the backup
blocks and all. It doesn't access any shared memory (related to WAL; it
does look at Buffers, to determine what to back up). The function in
xlog.c inserts the assembled record to the WAL, as is. It handles the
locking and WAL buffer management related to that.

What would you suggest? I don't think putting everything in one
XLogInsert function, like we have today, is better. Note that the second
patch makes xloginsert.c a lot more complicated.

> And it leads to things like the already complicated logic to retry after
> detecting missing fpws is now split across two files seems to confirm
> that. What happens right now is that XLogInsert() (with some helper
> routines) assembles the record. Then hands that off to
> XLogInsertRecData(). Which sometimes returns InvalidXLogRecPtr, and
> returns back to XLogInsert(), which reverses *some* of its work and then
> retries. Brr.

Modifying the chain that was already constructed is not pretty, but it's
not this patch's fault. I don't think splitting the functions makes that
any worse. (BTW, the follow-up patch to change the WAL format fixes
that; the per-buffer data is kept separately from the main data chain).

> Similary the 'page_writes_omitted' logic doesn't make me particularly
> happy. Previously we retried when there actually was a page affected by
> the different RedoRecPtr. Now we do it as soon as our copy of RedoRecPtr
> is out of date? Won't that quite often spuriously trigger retries? Am I
> missing something?
> Arguably this doesn't happen often enough to matter, but it's still
> something that we should explicitly discuss.

The situation where it leads to a spurious retry is when the constructed
WAL record skipped the FPW of a buffer, and RedoRecPtr was updated, but
the buffer stilll doesn't need to be FPW according to the updated
RedoRecPtr. That only happens if the same buffer was updated (by another
backend) after the new checkpoint began. I believe that's extremely rare.

We could make that more fine-grained, though. Instead of passing a
boolean 'fpw_sensitive' flag to XLogInsertRecData, pass the lowest LSN
among the pages whose FPW was skipped. If that's less than the new
RedoRecPtr, then retry is needed. That would eliminate the spurious retries.

> The implementation of the split seems to change the meaning of
> TRACE_POSTGRESQL_XLOG_INSERT - it's now counting retries. I don't
> particularly care about those tracepoints, but I see no simplification
> due to changing its meaning.

I wasn't sure what to do about that. I don't use tracepoints, and I
don't know what others use them for.

> I'm not a big fan of the naming for the new split. We have
> * XLogInsert() which is the external interface
> * XLogRecordAssemble() which builds the rdata chain that will be
> inserted
> * XLogInsertRecData() which actually inserts the data into the xl buffers.
>
> To me that's pretty inconsistent.

Got a better suggestion? I wanted to keep the name XLogInsert()
unchanged, to avoid code churn, and because it's still a good name for
that. XLogRecordAssemble is pretty descriptive for what it does,
although "XLogRecord" implies that it construct an XLogRecord struct. It
does fill in that too, but the main purpose is to build the XLogRecData
chain. Perhaps XLogAssembleRecord() would be better.

I'm not very happy with XLogInsertRecData myself. XLogInsertRecord?

> I'm also somewhat confused about the new split of the headers. I'm not
> adverse to splitting them, but it's getting a bit hard to understand:
> * xlog.h - generic mishmash
> * xlogdefs.h - Three typedefs and some #defines
> * xlog_fn.h - SQL functions
> * xlog_internal.h - Pretty random collection of stuff
> * xlogutils.h - "Utilities for replaying WAL records"
> * xlogreader.h - "generic XLog reading facility"
> * xloginsert.h - "Functions for generating WAL records"
> * xlogrecord.h - "Definitions for the WAL record format."
>
> Isn't that a bit too much? Pretty much the only files that have a
> recognizable separation to me are xlog_fn.h and xlogreader.h.

This is a matter of taste, I guess. I find xlog.h, xlogdefs.h and
xlog_internal.h fairly random, while the rest are have a clear function.
xlogutils.h is needed by redo functions, and xloginsert.h is needed for
generating WAL.

Note that the second patch adds more stuff to xloginsert.h and xlogrecord.h.

> You've removed the
> - *
> - * NB: this routine feels free to scribble on the XLogRecData structs,
> - * though not on the data they reference. This is OK since the
> XLogRecData
> - * structs are always just temporaries in the calling code.
> comment, but we still do, no?

True. XLogInsertRecData doesn't scribble on its input, but XLogInsert
still does. (this is moot after the second patch though)

> While not this patches blame, I see currently we finish the critical
> section in XLogInsert/XLogInsertRecData pretty early:
> END_CRIT_SECTION();
>
> /*
> * Update shared LogwrtRqst.Write, if we crossed page boundary.
> */
> if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
> {
> SpinLockAcquire(&XLogCtl->info_lck);
> /* advance global request to include new block(s) */
> if (XLogCtl->LogwrtRqst.Write < EndPos)
> XLogCtl->LogwrtRqst.Write = EndPos;
> /* update local result copy while I have the chance */
> LogwrtResult = XLogCtl->LogwrtResult;
> SpinLockRelease(&XLogCtl->info_lck);
> }
> I don't think this is particularly critical, but it still looks wrong to me.

Updating LogwrtRqst.Write is not critical. If it's not done for some
reason, everything still works. Keeping the critical section as small as
possible is a good idea, no?

> Hm. I think that's what I see for now.

Thanks!

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-10-30 17:56:22 Re: BRIN indexes - TRAP: BadArgument
Previous Message Robert Haas 2014-10-30 17:46:59 infinite loop in _bt_getstackbuf