Skip site navigation (1) Skip section navigation (2)

Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Date: 2012-03-05 16:50:02
Message-ID: 4F54EEBA.5070504@enterprisedb.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On 21.02.2012 13:19, Fujii Masao wrote:
> In some places, the spinlock "insertpos_lck" is taken while another
> spinlock "info_lck" is being held. Is this OK? What if unfortunately
> inner spinlock takes long to be taken?

Hmm, that's only done at a checkpoint (and a restartpoint), so I doubt 
that's a big issue in practice. We had the same pattern before the 
patch, just with WALInsertLock instead of insertpos_lck. Holding a 
spinlock longer is much worse than holding a lwlock longer, but 
nevertheless I don't think that's a problem.

If it does become a problem, we could provide a second copy of 
RedoRecPtr that could be read while holding info_lck, and would be 
allowed to lag behind slightly, while requiring insertpos_lck to 
read/update the main shared copy of RedoRecPtr. The only place that 
reads RedoRecPtr while holding info_lck is GetRedoRecPtr(), which would 
be happy with a value that lags behind a few milliseconds. We could 
still update that copy right after releasing insertpos_lck, so the delay 
between the two would be tiny.

> +	 * An xlog-switch record consumes all the remaining space on the
> +	 * WAL segment. We have already reserved it for us, but we still need
> +	 * to make sure it's been allocated and zeroed in the WAL buffers so
> +	 * that when the caller (or someone else) does XLogWrite(), it can
> +	 * really write out all the zeros.
>
> Why do we need to write out all the remaining space with zeros? In
> current master, we don't do that. A recovery code ignores the data
> following XLOG_SWITCH record, so I don't think that's required.

It's to keep the logic simpler. Before the patch, an xlog-switch just 
initialized the next page in the WAL buffers to insert to, to be the 
first page in the next segment. With this patch, we rely on a simple 
linear mapping from an XLogRecPtr to the WAL buffer that should contain 
that page (see XLogRecPtrToBufIdx()). Such a mapping is not possible if 
you sometimes skip over pages in the WAL buffers, so we allocate the 
buffers for those empty pages, too. Note that this means that an 
xlog-switch can blow through all your WAL buffers.

We could probably optimize that so that you don't need to actually 
write() and fsync() all the zeros, perhaps by setting a flag on the WAL 
buffer to indicate that it only contains padding for an xlog-switch. 
However, I don't see any easy way to avoid blowing the cache.

I'm thinking that xlog-switching happens so seldom, and typically on a 
fairly idle system, so we don't need to optimize it much. I guess we 
should measure the impact, though..

> +	/* XXX: before this patch, TRACE_POSTGRESQL_XLOG_SWITCH was not called
> +	 * if the xlog switch had no work to do, ie. if we were already at the
> +	 * beginning of a new XLOG segment. You can check if RecPtr points to
> +	 * beginning of a segment if you want to keep the distinction.
> +	 */
> +	TRACE_POSTGRESQL_XLOG_SWITCH();
>
> If so, RecPtr (or the flag indicating whether the xlog switch has no
> work to do) should
> be given to TRACE_POSTGRESQL_XLOG_SWITCH() as an argument?

I think I'll just move that call, so that the current behavior is retained.

> The followings are trivial comments:

Thanks, fixed these!

On 22.02.2012 03:34, Fujii Masao wrote:
 > When I ran the long-running performance test, I encountered the following
 > panic error.
 >
 >      PANIC:  could not find WAL buffer for 0/FF000000
 >
 > 0/FF000000 is the xlog file boundary, so the patch seems to handle
 > the xlog file boundary incorrectly. In the patch, current insertion lsn
 > is advanced by directly incrementing XLogRecPtr.xrecoff as follows.
 > But to handle the xlog file boundary correctly, we should use
 > XLByteAdvance() for that, instead?

Thanks, fixed this, too.

I made the locking a bit more strict in WaitXLogInsertionsToFinish(), so 
that it grabs the insertpos_lck to read nextslot. I previously thought 
that was not necessary, assuming that reading/writing an int32 is 
atomic, but I'm afraid there might be memory-ordering issues where the 
CurrPos of the most recent slot has not become visible to other backends 
yet, while the advancing of nextslot has.

That particular issue would be very hard to hit in practice, so I don't 
know if this could explain the recovery failures that Jeff saw. I got 
the test script running (thanks for that Jeff!), but unfortunately have 
not seen any failures yet (aside from the issue with crossing xlogid 
boundary), with either this or the older version of the patch.

Attached is a new version of the patch.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment: xloginsert-scale-10.patch
Description: text/x-diff (95.4 KB)

In response to

Responses

pgsql-hackers by date

Next:From: Simon RiggsDate: 2012-03-05 16:55:57
Subject: Re: archive_keepalive_command
Previous:From: Simon RiggsDate: 2012-03-05 16:46:38
Subject: Re: RFC: Making TRUNCATE more "MVCC-safe"

Privacy Policy | About PostgreSQL
Copyright © 1996-2013 The PostgreSQL Global Development Group