Re: write scalability

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: write scalability
Date: 2011-07-25 21:19:58
Message-ID: CAHyXU0x6OibNccfxvFEDivh1NXY0XZ3sqLr5Y_ZONfHeRJDD0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 25, 2011 at 3:07 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I've long harbored a suspicion, based on some testing I did on my home
> machine, that WALInsertLock is a big performance bottleneck.  But I
> just did some benchmarking that doesn't entirely support that
> contention.  This is on Nate Boley's 32-core machine, with the
> following settings:
>
> max_connections = 100
> shared_buffers = 8GB
> synchronous_commit = off
> checkpoint_segments = 100
> checkpoint_timeout = 15min
> checkpoint_completion_target = 0.9
>
> I did 5-minute pgbench runs with unlogged tables and with permanent
> tables, restarting the database server and reinitializing the tables
> between each run.  The number at the beginning of each line is the
> number of clients, while the p/u indicates which type of tables were
> used:
>
> 01p tps = 628.626815 (including connections establishing)
> 01p tps = 635.394288 (including connections establishing)
> 01p tps = 634.972789 (including connections establishing)
> 08p tps = 3342.787325 (including connections establishing)
> 08p tps = 3883.876813 (including connections establishing)
> 08p tps = 3941.253567 (including connections establishing)
> 32p tps = 5597.472192 (including connections establishing)
> 32p tps = 5738.139895 (including connections establishing)
> 32p tps = 5794.490934 (including connections establishing)
> 80p tps = 4499.685286 (including connections establishing)
> 80p tps = 4917.060441 (including connections establishing)
> 80p tps = 5050.931933 (including connections establishing)
>
> 01u tps = 672.469142 (including connections establishing)
> 01u tps = 671.256686 (including connections establishing)
> 01u tps = 670.421003 (including connections establishing)
> 08u tps = 4087.749529 (including connections establishing)
> 08u tps = 3797.750851 (including connections establishing)
> 08u tps = 4181.393560 (including connections establishing)
> 32u tps = 8956.346905 (including connections establishing)
> 32u tps = 8898.442517 (including connections establishing)
> 32u tps = 8971.591569 (including connections establishing)
> 80u tps = 7287.550952 (including connections establishing)
> 80u tps = 7266.816989 (including connections establishing)
> 80u tps = 7255.968109 (including connections establishing)
>
> The speed-up from using unlogged tables was not as large as I
> expected.  Turning off synchronous_commit here removes commit rate as
> the bottleneck, and I think perhaps the main benefit of unlogged
> tables in that case is the avoidance of I/O, and apparently this
> machine has enough I/O bandwidth, and just enough memory in general,
> that that's not an issue.
>
> With either type of tables, the 8 client results are about 6.1 times
> the single core results - not great, but not terrible, either.  With
> 32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x
> is a long way from linear.  vmstat reveals that CPU usage is
> substantially less than 100%.  After some investigation, I found that
> using unlogged tables wasn't actually getting rid of all the
> write-ahead logging - the commit records were still being issued.  So
> I hacked up RecordTransactionCommit() not to emit transaction commit
> records in that case.  That doesn't actually completely eliminate the
> WAL activity, because it still emits records for zeroing new SLRU
> pages for CLOG, but it takes a big chunk out of it.  The upshot is
> that this improved both raw performance and scalability, but not
> dramatically.  Unlogged table results, with this change:
>
> 01h tps = 708.189337 (including connections establishing)
> 01h tps = 704.030175 (including connections establishing)
> 01h tps = 701.644199 (including connections establishing)
> 08h tps = 5196.615955 (including connections establishing)
> 08h tps = 5126.162200 (including connections establishing)
> 08h tps = 5067.568727 (including connections establishing)
> 32h tps = 10661.275636 (including connections establishing)
> 32h tps = 10621.085226 (including connections establishing)
> 32h tps = 10575.267197 (including connections establishing)
> 80h tps = 7557.965666 (including connections establishing)
> 80h tps = 7545.697547 (including connections establishing)
> 80h tps = 7556.379921 (including connections establishing)
>
> Now the 32-client numbers have risen to 15.1x the single-client
> numbers, but that's still not great.
>
> What does this mean?  Well, XLogInsert does strikes me as an awfully
> complex piece of code to be running with a hot LWLock held in
> exclusive mode.  But even so, I think this certainly means that
> WALInsertLock, at least on this workload, is not the whole problem...
> in this test, I'm not only eliminating the overhead of inserting the
> WAL, but also the overhead of writing it, flushing it, and generating
> it.   So there is something, other than WAL insertion, which is taking
> a big bite out of performance here.
>
> As to what that something might be, I reran this last test with
> LWLOCK_STATS enabled and here are the top things that are blocking:
>
> lwlock 310: shacq 96846 exacq 108433 blk 16275
> lwlock 3: shacq 64 exacq 2628381 blk 36027
> lwlock 7: shacq 0 exacq 2628615 blk 85220
> lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
> lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
>
> During this 5-minute test run, an LWLock acquisition blocked 6180335
> times.  As you can see from the above results, ProcArrayLock accounts
> for 54% of that blocking, and CLogControlLock accounts for another
> 34%.  lwlock 7 is WALInsertLock, which manages to account for more
> than 1% of the blocking despite the fact that WAL has been largely
> eliminated in this test; lwlock 3 is XidGenLock.  I dunno what LWLock
> 310 is but it doesn't matter.  After that there is a long tail of
> other stuff with gradually decreasing amounts of blocking.  Experience
> with the read scalability stuff has taught me also to look at which
> LWLocks have the most shared acquisitions, as that can cause spinlock
> and cache line contention.  The top few culprits are:
>
> lwlock 504: shacq 5315030 exacq 0 blk 0
> lwlock 45: shacq 5967317 exacq 13284 blk 1722
> lwlock 39: shacq 8219988 exacq 13342 blk 2291
> lwlock 5: shacq 26306020 exacq 0 blk 0
> lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
> lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
>
> In all, there were 238777533 shared LWLock acquisitions during this
> test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon
> to be dealt with, as discussed elsewhere on-list), and then it gets
> down into the lock manager locks and a few others.

hm, all the CLogControlLock acquisitions in clog.c appear to be
exclusive...or did you mean shared in some other sense?
TransactionIdGetStatus is taking an exclusive lock which is a red flag
and a good optimization target, I think.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2011-07-25 22:06:35 Re: write scalability
Previous Message Greg Smith 2011-07-25 20:52:41 Re: pgbench --unlogged-tables