Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Scaling XLog insertion (was Re: Moving more work outside WALInsertLock)
Date: 2012-03-06 14:20:11
Message-ID: CAHGQGwGTJx8yN2ZNygzTqwHKQZJHJ5WG_gKQi1UB7HJB68=2CQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 6, 2012 at 1:50 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> +        * An xlog-switch record consumes all the remaining space on the
>> +        * WAL segment. We have already reserved it for us, but we still
>> need
>> +        * to make sure it's been allocated and zeroed in the WAL buffers
>> so
>> +        * that when the caller (or someone else) does XLogWrite(), it can
>> +        * really write out all the zeros.
>>
>> Why do we need to write out all the remaining space with zeros? In
>> current master, we don't do that. A recovery code ignores the data
>> following XLOG_SWITCH record, so I don't think that's required.
>
>
> It's to keep the logic simpler. Before the patch, an xlog-switch just
> initialized the next page in the WAL buffers to insert to, to be the first
> page in the next segment. With this patch, we rely on a simple linear
> mapping from an XLogRecPtr to the WAL buffer that should contain that page
> (see XLogRecPtrToBufIdx()). Such a mapping is not possible if you sometimes
> skip over pages in the WAL buffers, so we allocate the buffers for those
> empty pages, too. Note that this means that an xlog-switch can blow through
> all your WAL buffers.
>
> We could probably optimize that so that you don't need to actually write()
> and fsync() all the zeros, perhaps by setting a flag on the WAL buffer to
> indicate that it only contains padding for an xlog-switch. However, I don't
> see any easy way to avoid blowing the cache.
>
> I'm thinking that xlog-switching happens so seldom, and typically on a
> fairly idle system, so we don't need to optimize it much. I guess we should
> measure the impact, though..

Online backup which forces an xlog-switch twice might be performed under
a certain amount of load. So, to avoid the performance spike during online
backup, I think that the optimization which skips write() and fsync() of
the padding bytes is helpful.

> On 22.02.2012 03:34, Fujii Masao wrote:
>> When I ran the long-running performance test, I encountered the following
>> panic error.
>>
>>      PANIC:  could not find WAL buffer for 0/FF000000
>>
>> 0/FF000000 is the xlog file boundary, so the patch seems to handle
>> the xlog file boundary incorrectly. In the patch, current insertion lsn
>> is advanced by directly incrementing XLogRecPtr.xrecoff as follows.
>> But to handle the xlog file boundary correctly, we should use
>> XLByteAdvance() for that, instead?
>
> Thanks, fixed this, too.
>
> I made the locking a bit more strict in WaitXLogInsertionsToFinish(), so
> that it grabs the insertpos_lck to read nextslot. I previously thought that
> was not necessary, assuming that reading/writing an int32 is atomic, but I'm
> afraid there might be memory-ordering issues where the CurrPos of the most
> recent slot has not become visible to other backends yet, while the
> advancing of nextslot has.
>
> That particular issue would be very hard to hit in practice, so I don't know
> if this could explain the recovery failures that Jeff saw. I got the test
> script running (thanks for that Jeff!), but unfortunately have not seen any
> failures yet (aside from the issue with crossing xlogid boundary), with
> either this or the older version of the patch.
>
> Attached is a new version of the patch.

Thanks for the new patch!

In this new patch, I again was able to reproduce the assertion failure which
I described on the upthread.
http://archives.postgresql.org/message-id/CAHGQGwGRuNJ%3D_ctXwteNkFkdvMDNFYxFdn0D1cd-CqL0OgNCLg%40mail.gmail.com

$ uname -a
Linux hermes 3.0.0-16-generic #28-Ubuntu SMP Fri Jan 27 17:50:54 UTC
2012 i686 i686 i386 GNU/Linux

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-03-06 14:25:17 Re: Checksums, state of play
Previous Message Kohei KaiGai 2012-03-06 14:14:54 Re: [v9.2] Add GUC sepgsql.client_label