From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: write scalability |
Date: | 2011-07-25 21:19:58 |
Message-ID: | CAHyXU0x6OibNccfxvFEDivh1NXY0XZ3sqLr5Y_ZONfHeRJDD0w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Jul 25, 2011 at 3:07 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I've long harbored a suspicion, based on some testing I did on my home
> machine, that WALInsertLock is a big performance bottleneck. But I
> just did some benchmarking that doesn't entirely support that
> contention. This is on Nate Boley's 32-core machine, with the
> following settings:
>
> max_connections = 100
> shared_buffers = 8GB
> synchronous_commit = off
> checkpoint_segments = 100
> checkpoint_timeout = 15min
> checkpoint_completion_target = 0.9
>
> I did 5-minute pgbench runs with unlogged tables and with permanent
> tables, restarting the database server and reinitializing the tables
> between each run. The number at the beginning of each line is the
> number of clients, while the p/u indicates which type of tables were
> used:
>
> 01p tps = 628.626815 (including connections establishing)
> 01p tps = 635.394288 (including connections establishing)
> 01p tps = 634.972789 (including connections establishing)
> 08p tps = 3342.787325 (including connections establishing)
> 08p tps = 3883.876813 (including connections establishing)
> 08p tps = 3941.253567 (including connections establishing)
> 32p tps = 5597.472192 (including connections establishing)
> 32p tps = 5738.139895 (including connections establishing)
> 32p tps = 5794.490934 (including connections establishing)
> 80p tps = 4499.685286 (including connections establishing)
> 80p tps = 4917.060441 (including connections establishing)
> 80p tps = 5050.931933 (including connections establishing)
>
> 01u tps = 672.469142 (including connections establishing)
> 01u tps = 671.256686 (including connections establishing)
> 01u tps = 670.421003 (including connections establishing)
> 08u tps = 4087.749529 (including connections establishing)
> 08u tps = 3797.750851 (including connections establishing)
> 08u tps = 4181.393560 (including connections establishing)
> 32u tps = 8956.346905 (including connections establishing)
> 32u tps = 8898.442517 (including connections establishing)
> 32u tps = 8971.591569 (including connections establishing)
> 80u tps = 7287.550952 (including connections establishing)
> 80u tps = 7266.816989 (including connections establishing)
> 80u tps = 7255.968109 (including connections establishing)
>
> The speed-up from using unlogged tables was not as large as I
> expected. Turning off synchronous_commit here removes commit rate as
> the bottleneck, and I think perhaps the main benefit of unlogged
> tables in that case is the avoidance of I/O, and apparently this
> machine has enough I/O bandwidth, and just enough memory in general,
> that that's not an issue.
>
> With either type of tables, the 8 client results are about 6.1 times
> the single core results - not great, but not terrible, either. With
> 32 clients, there is some improvement: 13.4x vs. 9.1x, but even 13.4x
> is a long way from linear. vmstat reveals that CPU usage is
> substantially less than 100%. After some investigation, I found that
> using unlogged tables wasn't actually getting rid of all the
> write-ahead logging - the commit records were still being issued. So
> I hacked up RecordTransactionCommit() not to emit transaction commit
> records in that case. That doesn't actually completely eliminate the
> WAL activity, because it still emits records for zeroing new SLRU
> pages for CLOG, but it takes a big chunk out of it. The upshot is
> that this improved both raw performance and scalability, but not
> dramatically. Unlogged table results, with this change:
>
> 01h tps = 708.189337 (including connections establishing)
> 01h tps = 704.030175 (including connections establishing)
> 01h tps = 701.644199 (including connections establishing)
> 08h tps = 5196.615955 (including connections establishing)
> 08h tps = 5126.162200 (including connections establishing)
> 08h tps = 5067.568727 (including connections establishing)
> 32h tps = 10661.275636 (including connections establishing)
> 32h tps = 10621.085226 (including connections establishing)
> 32h tps = 10575.267197 (including connections establishing)
> 80h tps = 7557.965666 (including connections establishing)
> 80h tps = 7545.697547 (including connections establishing)
> 80h tps = 7556.379921 (including connections establishing)
>
> Now the 32-client numbers have risen to 15.1x the single-client
> numbers, but that's still not great.
>
> What does this mean? Well, XLogInsert does strikes me as an awfully
> complex piece of code to be running with a hot LWLock held in
> exclusive mode. But even so, I think this certainly means that
> WALInsertLock, at least on this workload, is not the whole problem...
> in this test, I'm not only eliminating the overhead of inserting the
> WAL, but also the overhead of writing it, flushing it, and generating
> it. So there is something, other than WAL insertion, which is taking
> a big bite out of performance here.
>
> As to what that something might be, I reran this last test with
> LWLOCK_STATS enabled and here are the top things that are blocking:
>
> lwlock 310: shacq 96846 exacq 108433 blk 16275
> lwlock 3: shacq 64 exacq 2628381 blk 36027
> lwlock 7: shacq 0 exacq 2628615 blk 85220
> lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
> lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
>
> During this 5-minute test run, an LWLock acquisition blocked 6180335
> times. As you can see from the above results, ProcArrayLock accounts
> for 54% of that blocking, and CLogControlLock accounts for another
> 34%. lwlock 7 is WALInsertLock, which manages to account for more
> than 1% of the blocking despite the fact that WAL has been largely
> eliminated in this test; lwlock 3 is XidGenLock. I dunno what LWLock
> 310 is but it doesn't matter. After that there is a long tail of
> other stuff with gradually decreasing amounts of blocking. Experience
> with the read scalability stuff has taught me also to look at which
> LWLocks have the most shared acquisitions, as that can cause spinlock
> and cache line contention. The top few culprits are:
>
> lwlock 504: shacq 5315030 exacq 0 blk 0
> lwlock 45: shacq 5967317 exacq 13284 blk 1722
> lwlock 39: shacq 8219988 exacq 13342 blk 2291
> lwlock 5: shacq 26306020 exacq 0 blk 0
> lwlock 4: shacq 28667307 exacq 2628524 blk 3356651
> lwlock 11: shacq 84913908 exacq 4539551 blk 2119423
>
> In all, there were 238777533 shared LWLock acquisitions during this
> test: 35% CLogControlLock, 12% ProcArrayLock, 11% SInvalReadLock (soon
> to be dealt with, as discussed elsewhere on-list), and then it gets
> down into the lock manager locks and a few others.
hm, all the CLogControlLock acquisitions in clog.c appear to be
exclusive...or did you mean shared in some other sense?
TransactionIdGetStatus is taking an exclusive lock which is a red flag
and a good optimization target, I think.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2011-07-25 22:06:35 | Re: write scalability |
Previous Message | Greg Smith | 2011-07-25 20:52:41 | Re: pgbench --unlogged-tables |