Quick Links

Re: pgbench vs. wait events

From:	Alfred Perlstein <alfred(at)freebsd(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: pgbench vs. wait events
Date:	2016-10-07 03:55:32
Message-ID:	daf9117e-d08f-6d22-4381-d273b5a788b3@freebsd.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Robert,

This contention on WAL reminds me of another scenario I've heard about
that was similar.

To fix things what happened was that anyone that the first person to
block would be responsible for writing out all buffers for anyone
blocked behind "him".

The for example if you have many threads, A, B, C, D

If while A is writing to WAL and hold the lock, then B arrives, B of
course blocks, then C comes along and blocks as well, then D.

Finally A finishes its write and then ....

Now you have two options for resolving this, either

1) A drops its lock, B picks up the lock... B writes its buffer and then
drops the lock. Then C gets the lock, writes its buffer, drops the
lock, then finally D gets the lock, writes its buffer and then drops the
lock.

2) A then writes out B's, C's, and D's buffers, then A drops the lock,
B, C and D wake up, note that their respective buffers are written and
just return. This greatly speeds up the system. (just be careful to
make sure A doesn't do "too much work" otherwise you can get a sort of
livelock if too many threads are blocked behind it, generally only issue
one additional flush on behalf of other threads, do not "loop until the
queue is empty")

I'm not sure if this is actually possible with the way WAL is
implemented, (or perhaps if this strategy is already implemented) but
it's definitely worth if not done already as it can speed things up
enormously.

On 10/6/16 11:38 AM, Robert Haas wrote:
> Hi,
>
> I decided to do some testing on hydra (IBM-provided community
> resource, POWER, 16 cores/64 threads, kernel 3.2.6-3.fc16.ppc64) using
> the newly-enhanced wait event stuff to try to get an idea of what
> we're waiting for during pgbench. I did 30-minute pgbench runs with
> various configurations, but all had max_connections = 200,
> shared_buffers = 8GB, maintenance_work_mem = 4GB, synchronous_commit =
> off, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
> log_line_prefix = '%t [%p] ', max_wal_size = 40GB, log_checkpoints =
> on. During each run, I ran this psql script in another window and
> captured the output:
>
> \t
> select wait_event_type, wait_event from pg_stat_activity where pid !=
> pg_backend_pid()
> \watch 0.5
>
> Then, I used a little shell-scripting to count up the number of times
> each wait event occurred in the output. First, I tried scale factor
> 3000 with 32 clients and got these results:
>
> 1 LWLockTranche | buffer_mapping
> 9 LWLockNamed | CLogControlLock
> 14 LWLockNamed | ProcArrayLock
> 16 Lock | tuple
> 25 LWLockNamed | CheckpointerCommLock
> 49 LWLockNamed | WALBufMappingLock
> 122 LWLockTranche | clog
> 182 Lock | transactionid
> 287 LWLockNamed | XidGenLock
> 1300 Client | ClientRead
> 1375 LWLockTranche | buffer_content
> 3990 Lock | extend
> 21014 LWLockNamed | WALWriteLock
> 28497 |
> 58279 LWLockTranche | wal_insert
>
> tps = 1150.803133 (including connections establishing)
>
> What jumps out here is, at least to me, is that there is furious
> contention on both the wal_insert locks and on WALWriteLock.
> Apparently, the system simply can't get WAL on disk fast enough to
> keep up with this workload. Relation extension locks and
> buffer_content locks also are also pretty common, both ahead of
> ClientRead, a relatively uncommon wait event on this test. The load
> average on the system was only about 3 during this test, indicating
> that most processes are in fact spending most of their time off-CPU.
> The first thing I tried was switching to unlogged tables, which
> produces these results:
>
> 1 BufferPin | BufferPin
> 1 LWLockTranche | lock_manager
> 2 LWLockTranche | buffer_mapping
> 8 LWLockNamed | ProcArrayLock
> 9 LWLockNamed | CheckpointerCommLock
> 9 LWLockNamed | CLogControlLock
> 11 LWLockTranche | buffer_content
> 37 LWLockTranche | clog
> 153 Lock | tuple
> 388 LWLockNamed | XidGenLock
> 827 Lock | transactionid
> 1267 Client | ClientRead
> 20631 Lock | extend
> 91767 |
>
> tps = 1223.239416 (including connections establishing)
>
> If you don't look at the TPS number, these results look like a vast
> improvement. The overall amount of time spent not waiting for
> anything is now much higher, and the problematic locks have largely
> disappeared from the picture. However, the load average now shoots up
> to about 30, because most of the time that the backends are "not
> waiting for anything" they are in fact in kernel wait state D; that
> is, they're stuck doing I/O. This suggests that we might want to
> consider advertising a wait state when a backend is doing I/O, so we
> can measure this sort of thing.
>
> Next, I tried lowering the scale factor to something that fits in
> shared buffers. Here are the results at scale factor 300:
>
> 14 Lock | tuple
> 22 LWLockTranche | lock_manager
> 39 LWLockNamed | WALBufMappingLock
> 331 LWLockNamed | CLogControlLock
> 461 LWLockNamed | ProcArrayLock
> 536 Lock | transactionid
> 552 Lock | extend
> 716 LWLockTranche | buffer_content
> 763 LWLockNamed | XidGenLock
> 2113 LWLockNamed | WALWriteLock
> 6190 LWLockTranche | wal_insert
> 25002 Client | ClientRead
> 78466 |
>
> tps = 27651.562835 (including connections establishing)
>
> Obviously, there's a vast increase in TPS, and the backends seem to
> spend most of their time actually doing work. ClientRead is now the
> overwhelmingly dominant wait event, although wal_insert and
> WALWriteLock contention is clearly still a significant problem.
> Contention on other locks is apparently quite rare. Notice that
> client reads are really significant here - more than 20% of the time
> we sample what a backend is doing, it's waiting for the next query.
> It seems unlikely that performance on this workload can be improved
> very much by optimizing anything other than WAL writing, because no
> other wait event appears in more than 1% of the samples. It's not
> clear how much of the WAL-related stuff is artificial lock contention
> and how much of it is the limited speed at which the disk can rotate.
>
> However, I was curious about what's going on with CLogControlLock,
> persuant to the other thread on that topic, so I reran this with
> unlogged tables. At scale factor 300, 32 clients, unlogged tables, I
> get these results:
>
> 6 LWLockTranche | wal_insert
> 12 Lock | tuple
> 26 LWLockTranche | lock_manager
> 39 LWLockTranche | buffer_content
> 353 LWLockNamed | CLogControlLock
> 377 Lock | extend
> 491 Lock | transactionid
> 500 LWLockNamed | ProcArrayLock
> 696 LWLockNamed | XidGenLock
> 27685 Client | ClientRead
> 84930 |
>
> WAL contention is eliminated, and CLog contention doesn't increase.
> Next, I tried ramping up the client count to see what effect that had.
> These next three results are all at scale factor 300 with unlogged
> tables.
>
> 64 clients:
>
> 1 BufferPin | BufferPin
> 2 LWLockTranche | buffer_mapping
> 51 LWLockTranche | wal_insert
> 52 Lock | relation
> 132 LWLockTranche | lock_manager
> 154 LWLockTranche | buffer_content
> 209 Lock | tuple
> 940 Lock | extend
> 1412 LWLockNamed | XidGenLock
> 2337 LWLockNamed | CLogControlLock
> 2813 LWLockNamed | ProcArrayLock
> 3859 Lock | transactionid
> 83760 Client | ClientRead
> 133840 |
>
> tps = 36672.484524 (including connections establishing)
>
> 96 clients:
>
> 1 LWLockTranche | proc
> 53 LWLockTranche | wal_insert
> 305 LWLockTranche | buffer_content
> 391 LWLockTranche | lock_manager
> 797 Lock | tuple
> 1074 Lock | extend
> 2492 LWLockNamed | XidGenLock
> 4185 LWLockNamed | ProcArrayLock
> 6003 LWLockNamed | CLogControlLock
> 11348 Lock | transactionid
> 142881 Client | ClientRead
> 174011 |
>
> tps = 37214.257661 (including connections establishing)
>
> 192 clients:
>
> 1 LWLockTranche | buffer_mapping
> 2 BufferPin | BufferPin
> 4 LWLockTranche | clog
> 8 LWLockTranche | proc
> 87 LWLockTranche | wal_insert
> 708 LWLockTranche | buffer_content
> 1892 LWLockTranche | lock_manager
> 2225 Lock | extend
> 4796 LWLockNamed | ProcArrayLock
> 5105 LWLockNamed | XidGenLock
> 7982 Lock | tuple
> 11882 LWLockNamed | CLogControlLock
> 64091 Lock | transactionid
> 275481 |
> 308627 Client | ClientRead
>
> tps = 35018.906863 (including connections establishing)
>
> There are a couple of interesting things here. First, CLogControlLock
> contention becomes more significant as the number of the clients goes
> up. At 32 clients, it is the 6th-most frequent wait event; at 192
> clients, it's moved up to 4th, and it's clearly growing at a faster
> rate than some of the others. As we go from 32 to 64 to 96 to 192
> clients, ProcArrayLock is captured in 500, 2813, 4185, 4796 samples,
> but CLogControLock is captured in 353, 2337, 6003, 11882 samples, so
> it goes from being only 2/3 of ProcArrayLock to being more than twice
> ProcArrayLock. However, in absolute terms, it's still not very
> significant. Even at 192 clients, where it's the fourth-most-common
> wait event, it appears in only 1.7% of samples, which means backends
> are only waiting for it about 1.7% of the time. It appears that, on
> this system, this workload is NEVER going to get meaningfully faster
> by improving CLogControlLock.
>
> Second, ClientRead becomes a bigger and bigger issue as the number of
> clients increases; by 192 clients, it appears in 45% of the samples.
> That basically means that pgbench is increasingly unable to keep up
> with the server; for whatever reason, it suffers more than the server
> does from the increasing lack of CPU resources. Third,
> Lock/transactionid and Lock/tuple become more and more common as the
> number of clients increases; these happen when two different pgbench
> threads deciding to hit the same row at the same time. Due to the
> birthday paradox this increases pretty quickly as the number of
> clients ramps up, but it's not really a server issue: there's nothing
> the server can do about the possibility that two or more clients pick
> the same random number at the same time.
>

In response to

pgbench vs. wait events at 2016-10-06 18:38:56 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Langote	2016-10-07 04:19:28	Re: pg_dump getBlobs query broken for 7.3 servers
Previous Message	Alfred Perlstein	2016-10-07 03:52:22	Re: pgbench vs. wait events