Re: wal_buffers, redux

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: wal_buffers, redux
Date: 2012-03-14 19:29:37
Message-ID: CAMkU=1wYpjqYA-phwk-WtkY8gfKSuz+jCTST9W9QmG32_MvZYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 12, 2012 at 7:16 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Mar 11, 2012 at 11:51 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Sun, Mar 11, 2012 at 12:55 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> I've finally been able to run some more tests of the effect of
>>> adjusting wal_buffers to values higher than 16MB.  I ran the test on
>>> the 16 core (x 4 hw threads/core) IBM POWER7 machine, with my usual
>>> configuration settings:
>>>
>>> shared_buffers = 8GB
>>> maintenance_work_mem = 1GB
>>> synchronous_commit = off
>>> checkpoint_segments = 300
>>> checkpoint_timeout = 15min
>>> checkpoint_completion_target = 0.9
>>> wal_writer_delay = 20ms
>>>
>>> I ran three 30-minute tests at scale factor 300 with wal_buffers set
>>> at various values from 16MB up to 160MB, in multiples of 16MB, using
>>> pgbench with 32 clients and 32 threads in each case.  The short
>>> version is that 32MB seems to be significantly better than 16MB, by
>>> about 1000 tps, and after that it gets murky; full results are below.
>>
>> Currently the max of wal_buffers is 16MB (i.e., the size of one WAL file)
>> when it's set to -1. Thanks to your result, we should increase the max to
>> 32MB?
>
> I think that might be a good idea, although I'm not entirely convinced
> that we understand why increasing wal_buffers is helping as much as it
> is.  I stuck an elog() into AdvanceXLInsertBuffer() to complain in the
> case that we were writing buffers while holding the insert lock.
> Then, I reran 30-minute tests 32 clients, one with wal_buffers=16MB
> and the other wal_buffers=32MB.  On the 16MB test, the elog() fired 15
> times in a single second shortly after the start of the test, and then
> 9 more times over the rest of the test.  On the 32MB test, the elog()
> fired a total 6 times over the course of the test.  The first test got
> 14320 tps, while the second got 15026 tps.  I find that quite
> surprising, because although WAL buffer wraparound is certainly bad
> (on this test, it probably brings the system completely to a halt
> until fsync() finishes) it really shouldn't lock up the system for
> multiple seconds at a time.  And yet that's what it would need to be
> doing to account for the tps discrepancy on this test, considering how
> rarely it occurs.

Where did you put the elog? If you put it near
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START(), it is probably too
late in the code.

If someone else (like the background wal writer) is blocked on an
fsync, then AdvanceXLInsertBuffer will block on getting the
WALWriteLock. Once it gets the lock, it will likely find it no longer
needs to do the write (because it was done by the thing that just
blocked and then released it), but at that point the damage has
already been done. The damage is not that it has to do a write, but
that it had to block (indirectly) on an fsync which it didn't really
care about.

One possibility is that the writer should update
xlogctl->LogwrtResult.Write between finishing the write and starting
the fsync. That way an AdvanceXLInsertBuffer that arrives during the
fsync could see that the data is needs to overwrite has already been
written, even if not yet fsynced, and that would be enough to allow it
proceed.

That would be ugly, because AdvanceXLInsertBuffer that arrives during
the write itself would check xlogctl and then block until the fsync
finished, with no way to wake up as soon as the write-part finished.
So making it work cleanly would require a redesign of the whole
locking scheme. Which the Scaling patch is already doing.

>
> Studying AdvanceXLInsertBuffer() a bit more, I'm wondering if the
> problem isn't so much - or isn't only - that it's expensive to write
> buffers while also holding WALInsertLock.  Maybe it's too expensive
> even to acquire WalWriteLock in the first place - that is, the real
> problem isn't so much the wraparound condition itself, but the expense
> of testing whether a possible wraparound has actually occurred.  A
> quick test suggests that we acquire WALWriteLock here much more often
> than we actually write anything while holding it, and that we get a
> big burst of WALWriteLock acquisitions here immediately after a
> checkpoint.  I don't have any proof that this is what's causing the
> tps drop with smaller wal_buffers, but I think there has to be
> something other than an actual wraparound condition causing problems
> here, because that just doesn't seem to happen frequently enough to be
> an issue.

I think my analysis is pretty much a re-wording of yours, but I'd
emphasize that getting the WALWriteLock is bad not just because they
fight over the lock, but because someone else (probably background wal
writer) is camping out on the lock while doing an fsync.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2012-03-14 20:10:29 Re: Faster compression, again
Previous Message Alvaro Herrera 2012-03-14 19:07:32 Re: Too many IO?