Quick Links

Re: Speed up Clog Access by increasing CLOG buffers

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Speed up Clog Access by increasing CLOG buffers
Date:	2016-10-13 02:23:24
Message-ID:	4aafed4c-8002-9a9c-0d15-73d19ab7e317@2ndquadrant.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 10/12/2016 08:55 PM, Robert Haas wrote:
> On Wed, Oct 12, 2016 at 3:21 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>> I think at higher client count from client count 96 onwards contention
>> on CLogControlLock is clearly visible and which is completely solved
>> with group lock patch.
>>
>> And at lower client count 32,64 contention on CLogControlLock is not
>> significant hence we can not see any gain with group lock patch.
>> (though we can see some contention on CLogControlLock is reduced at 64
>> clients.)
>
> I agree with these conclusions. I had a chance to talk with Andres
> this morning at Postgres Vision and based on that conversation I'd
> like to suggest a couple of additional tests:
>
> 1. Repeat this test on x86. In particular, I think you should test on
> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>
> 2. Repeat this test with a mixed read-write workload, like -b
> tpcb-like(at)1 -b select-only(at)9
>

FWIW, I'm already running similar benchmarks on an x86 machine with 72
cores (144 with HT). It's "just" a 4-socket system, but the results I
got so far seem quite interesting. The tooling and results (pushed
incrementally) are available here:

https://bitbucket.org/tvondra/hp05-results/overview

The tooling is completely automated, and it also collects various stats,
like for example the wait event. So perhaps we could simply run it on
ctulhu and get comparable results, and also more thorough data sets than
just snippets posted to the list?

There's also a bunch of reports for the 5 already completed runs

- dilip-300-logged-sync
- dilip-300-unlogged-sync
- pgbench-300-logged-sync-skip
- pgbench-300-unlogged-sync-noskip
- pgbench-300-unlogged-sync-skip

The name identifies the workload type, scale and whether the tables are
wal-logged (for pgbench the "skip" means "-N" while "noskip" does
regular pgbench).

For example the "reports/wait-events-count-patches.txt" compares the
wait even stats with different patches applied (and master):

https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/wait-events-count-patches.txt?at=master&fileviewer=file-view-default

and average tps (from 3 runs, 5 minutes each):

https://bitbucket.org/tvondra/hp05-results/src/506d0bee9e6557b015a31d72f6c3506e3f198c17/reports/tps-avg-patches.txt?at=master&fileviewer=file-view-default

There are certainly interesting bits. For example while the "logged"
case is dominated y WALWriteLock for most client counts, for large
client counts that's no longer true.

Consider for example dilip-300-logged-sync results with 216 clients:

wait_event | master | gran_lock | no_cont_lock | group_upd
--------------------+---------+-----------+--------------+-----------
CLogControlLock | 624566 | 474261 | 458599 | 225338
WALWriteLock | 431106 | 623142 | 619596 | 699224
| 331542 | 358220 | 371393 | 537076
buffer_content | 261308 | 134764 | 138664 | 102057
ClientRead | 59826 | 100883 | 103609 | 118379
transactionid | 26966 | 23155 | 23815 | 31700
ProcArrayLock | 3967 | 3852 | 4070 | 4576
wal_insert | 3948 | 10430 | 9513 | 12079
clog | 1710 | 4006 | 2443 | 925
XidGenLock | 1689 | 3785 | 4229 | 3539
tuple | 965 | 617 | 655 | 840
lock_manager | 300 | 571 | 619 | 802
WALBufMappingLock | 168 | 140 | 158 | 147
SubtransControlLock | 60 | 115 | 124 | 105

Clearly, CLOG is an issue here, and it's (slightly) improved by all the
patches (group_update performing the best). And with 288 clients (which
is 2x the number of virtual cores in the machine, so not entirely crazy)
you get this:

wait_event | master | gran_lock | no_cont_lock | group_upd
--------------------+---------+-----------+--------------+-----------
CLogControlLock | 901670 | 736822 | 728823 | 398111
buffer_content | 492637 | 318129 | 319251 | 270416
WALWriteLock | 414371 | 593804 | 589809 | 656613
| 380344 | 452936 | 470178 | 745790
ClientRead | 60261 | 111367 | 111391 | 126151
transactionid | 43627 | 34585 | 35464 | 48679
wal_insert | 5423 | 29323 | 25898 | 30191
ProcArrayLock | 4379 | 3918 | 4006 | 4582
clog | 2952 | 9135 | 5304 | 2514
XidGenLock | 2182 | 9488 | 8894 | 8595
tuple | 2176 | 1288 | 1409 | 1821
lock_manager | 323 | 797 | 827 | 1006
WALBufMappingLock | 124 | 124 | 146 | 206
SubtransControlLock | 85 | 146 | 170 | 120

So even buffer_content gets ahead of the WALWriteLock. I wonder whether
this might be because of only having 128 buffers for clog pages, causing
contention on this system (surely, systems with 144 cores were not that
common when the 128 limit was introduced).

So the patch has positive impact even with WAL, as illustrated by tps
improvements (for large client counts):

clients | master | gran_locking | no_content_lock | group_update
---------+--------+--------------+-----------------+--------------
36 | 39725 | 39627 | 41203 | 39763
72 | 70533 | 65795 | 65602 | 66195
108 | 81664 | 87415 | 86896 | 87199
144 | 68950 | 98054 | 98266 | 102834
180 | 105741 | 109827 | 109201 | 113911
216 | 62789 | 92193 | 90586 | 98995
252 | 94243 | 102368 | 100663 | 107515
288 | 57895 | 83608 | 82556 | 91738

I find the tps fluctuation intriguing, and I'd like to see that fixed
before committing any of the patches.

For pgbench-300-logged-sync-skip (the other WAL-logging test already
completed), the CLOG contention is also reduced significantly, but the
tps did not improve this significantly.

For the the unlogged case (dilip-300-unlogged-sync), the results are
fairly similar - CLogControlLock and buffer_content dominating the wait
even profiles (WALWriteLock is missing, of course), and the average tps
fluctuates in almost exactly the same way.

Interestingly, no fluctuation for the pgbench tests. For example for
pgbench-300-unlogged-sync-ski (i.e. pgbench -N) the result is this:

clients | master | gran_locking | no_content_lock | group_update
---------+--------+--------------+-----------------+--------------
36 | 147265 | 148663 | 148985 | 146559
72 | 162645 | 209070 | 207841 | 204588
108 | 135785 | 219982 | 218111 | 217588
144 | 113979 | 228683 | 228953 | 226934
180 | 96930 | 230161 | 230316 | 227156
216 | 89068 | 224241 | 226524 | 225805
252 | 78203 | 222507 | 225636 | 224810
288 | 63999 | 204524 | 225469 | 220098

That's a fairly significant improvement, and the behavior is very
smooth. Sadly, with WAL logging (pgbench-300-logged-sync-ski) the tps
drops back to master mostly thanks to WALWriteLock.

Another interesting aspect of the patches is impact on variability of
results - for example looking at dilip-300-unlogged-sync, the overall
average tps (for the three runs combined) and for each of the tree runs,
looks like this:

clients | avg_tps | tps_1 | tps_2 | tps_3
---------+---------+-----------+-----------+-----------
36 | 117332 | 115042 | 116125 | 120841
72 | 90917 | 72451 | 119915 | 80319
108 | 96070 | 106105 | 73606 | 108580
144 | 81422 | 71094 | 102109 | 71063
180 | 88537 | 98871 | 67756 | 99021
216 | 75962 | 65584 | 96365 | 66010
252 | 59941 | 57771 | 64756 | 57289
288 | 80851 | 93005 | 56454 | 93313

Notice the variability between the runs - the difference between min and
max is often more than 40%. Now compare it to results with the
"group-update" patch applied:

clients | avg_tps | tps_1 | tps_2 | tps_3
---------+---------+-----------+-----------+-----------
36 | 116273 | 117031 | 116005 | 115786
72 | 145273 | 147166 | 144839 | 143821
108 | 89892 | 89957 | 89585 | 90133
144 | 130176 | 130310 | 130565 | 129655
180 | 81944 | 81927 | 81951 | 81953
216 | 124415 | 124367 | 123228 | 125651
252 | 76723 | 76467 | 77266 | 76436
288 | 120072 | 121205 | 119731 | 119283

In this case there's pretty much no cross-run variability, the
differences are usually within 2%, so basically random noise. (There's
of course the variability depending on client count, but that was
already mentioned).

There's certainly much more interesting stuff in the results, but I
don't have time for more thorough analysis now - I only intended to do
some "quick benchmarking" on the patch, and I've already spent days on
this, and I have other things to do.

I'll take care of collecting data for the remaining cases on this
machine (and possibly running the same tests on the other one, if I
manage to get access to it again). But I'll leave further analysis of
the collected data up to the patch authors, or some volunteers.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Re: Speed up Clog Access by increasing CLOG buffers at 2016-10-12 18:55:26 from Robert Haas

Responses

Re: Speed up Clog Access by increasing CLOG buffers at 2016-10-15 07:43:03 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2016-10-13 02:40:23	Re: Non-empty default log_line_prefix
Previous Message	Michael Paquier	2016-10-13 02:22:58	Re: Renaming of pg_xlog and pg_clog