Re: Speed up Clog Access by increasing CLOG buffers

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-10-20 15:33:21
Message-ID: acf46406-cf6d-41fe-4118-4e1c960b4790@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/20/2016 09:36 AM, Dilip Kumar wrote:
> On Thu, Oct 13, 2016 at 12:25 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I agree with these conclusions. I had a chance to talk with Andres
>> this morning at Postgres Vision and based on that conversation I'd
>> like to suggest a couple of additional tests:
>>
>> 1. Repeat this test on x86. In particular, I think you should test on
>> the EnterpriseDB server cthulhu, which is an 8-socket x86 server.
>
> I have done my test on cthulhu, basic difference is that In POWER we
> saw ClogControlLock on top at 96 and more client with 300 scale
> factor. But, on cthulhu at 300 scale factor transactionid lock is
> always on top. So I repeated my test with 1000 scale factor as well on
> cthulhu.
>
> All configuration is same as my last test.
>
> Test with 1000 scale factor
> -------------------------------------
>
> Test1: number of clients: 192
>
> Head:
> tps = 21206.108856 (including connections establishing)
> tps = 21206.245441 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_192_ul.txt
> 310489 LWLockNamed | CLogControlLock
> 296152 |
> 35537 Lock | transactionid
> 15821 LWLockTranche | buffer_mapping
> 10342 LWLockTranche | buffer_content
> 8427 LWLockTranche | clog
> 3961
> 3165 Lock | extend
> 2861 Lock | tuple
> 2781 LWLockNamed | ProcArrayLock
> 1104 LWLockNamed | XidGenLock
> 745 LWLockTranche | lock_manager
> 371 LWLockNamed | CheckpointerCommLock
> 70 LWLockTranche | wal_insert
> 5 BufferPin | BufferPin
> 3 LWLockTranche | proc
>
> Patch:
> tps = 28725.038933 (including connections establishing)
> tps = 28725.367102 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_192_ul.txt
> 540061 |
> 57810 LWLockNamed | CLogControlLock
> 36264 LWLockTranche | buffer_mapping
> 29976 Lock | transactionid
> 4770 Lock | extend
> 4735 LWLockTranche | clog
> 4479 LWLockNamed | ProcArrayLock
> 4006
> 3955 LWLockTranche | buffer_content
> 2505 LWLockTranche | lock_manager
> 2179 Lock | tuple
> 1977 LWLockNamed | XidGenLock
> 905 LWLockNamed | CheckpointerCommLock
> 222 LWLockTranche | wal_insert
> 8 LWLockTranche | proc
>
> Test2: number of clients: 96
>
> Head:
> tps = 25447.861572 (including connections establishing)
> tps = 25448.012739 (excluding connections establishing)
> 261611 |
> 69604 LWLockNamed | CLogControlLock
> 6119 Lock | transactionid
> 4008
> 2874 LWLockTranche | buffer_mapping
> 2578 LWLockTranche | buffer_content
> 2355 LWLockNamed | ProcArrayLock
> 1245 Lock | extend
> 1168 LWLockTranche | clog
> 232 Lock | tuple
> 217 LWLockNamed | CheckpointerCommLock
> 160 LWLockNamed | XidGenLock
> 158 LWLockTranche | lock_manager
> 78 LWLockTranche | wal_insert
> 5 BufferPin | BufferPin
>
> Patch:
> tps = 32708.368938 (including connections establishing)
> tps = 32708.765989 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_96_ul.txt
> 326601 |
> 7471 LWLockNamed | CLogControlLock
> 5387 Lock | transactionid
> 4018
> 3331 LWLockTranche | buffer_mapping
> 3144 LWLockNamed | ProcArrayLock
> 1372 Lock | extend
> 722 LWLockTranche | buffer_content
> 393 LWLockNamed | XidGenLock
> 237 LWLockTranche | lock_manager
> 234 Lock | tuple
> 194 LWLockTranche | clog
> 96 Lock | relation
> 88 LWLockTranche | wal_insert
> 34 LWLockNamed | CheckpointerCommLock
>
> Test3: number of clients: 64
>
> Head:
>
> tps = 28264.194438 (including connections establishing)
> tps = 28264.336270 (excluding connections establishing)
>
> 218264 |
> 10314 LWLockNamed | CLogControlLock
> 4019
> 2067 Lock | transactionid
> 1950 LWLockTranche | buffer_mapping
> 1879 LWLockNamed | ProcArrayLock
> 592 Lock | extend
> 565 LWLockTranche | buffer_content
> 222 LWLockNamed | XidGenLock
> 143 LWLockTranche | clog
> 131 LWLockNamed | CheckpointerCommLock
> 63 LWLockTranche | lock_manager
> 52 Lock | tuple
> 35 LWLockTranche | wal_insert
>
> Patch:
> tps = 27906.376194 (including connections establishing)
> tps = 27906.531392 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_64_ul.txt
> 228108 |
> 4039
> 2294 Lock | transactionid
> 2116 LWLockTranche | buffer_mapping
> 1757 LWLockNamed | ProcArrayLock
> 1553 LWLockNamed | CLogControlLock
> 800 Lock | extend
> 403 LWLockTranche | buffer_content
> 92 LWLockNamed | XidGenLock
> 74 LWLockTranche | lock_manager
> 42 Lock | tuple
> 35 LWLockTranche | wal_insert
> 34 LWLockTranche | clog
> 14 LWLockNamed | CheckpointerCommLock
>
> Test4: number of clients: 32
>
> Head:
> tps = 27587.999912 (including connections establishing)
> tps = 27588.119611 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_32_ul.txt
> 117762 |
> 4031
> 614 LWLockNamed | ProcArrayLock
> 379 LWLockNamed | CLogControlLock
> 344 Lock | transactionid
> 183 Lock | extend
> 102 LWLockTranche | buffer_mapping
> 71 LWLockTranche | buffer_content
> 39 LWLockNamed | XidGenLock
> 25 LWLockTranche | lock_manager
> 3 LWLockTranche | wal_insert
> 3 LWLockTranche | clog
> 2 LWLockNamed | CheckpointerCommLock
> 2 Lock | tuple
>
> Patch:
> tps = 28291.428848 (including connections establishing)
> tps = 28291.586435 (excluding connections establishing)
> [dilip(dot)kumar(at)cthulhu bin]$ cat 1000_32_ul.txt
> 116596 |
> 4041
> 757 LWLockNamed | ProcArrayLock
> 407 LWLockNamed | CLogControlLock
> 358 Lock | transactionid
> 183 Lock | extend
> 142 LWLockTranche | buffer_mapping
> 77 LWLockTranche | buffer_content
> 68 LWLockNamed | XidGenLock
> 35 LWLockTranche | lock_manager
> 15 LWLockTranche | wal_insert
> 7 LWLockTranche | clog
> 7 Lock | tuple
> 4 LWLockNamed | CheckpointerCommLock
> 1 Tuples only is on.
>
> Summary:
> - At 96 and more clients count we can see ClogControlLock at the top.
> - With patch contention on ClogControlLock is reduced significantly.
> I think these behaviours are same as we saw on power.
>
> With 300 scale factor:
> - Contention on ClogControlLock is significant only at 192 client
> (still transaction id lock is on top), Which is completely removed
> with group lock patch.
>
> For 300 scale factor, I am posting data only at 192 client count (If
> anyone interested in other data I can post).
>

In the results you've posted on 10/12, you've mentioned a regression
with 32 clients, where you got 52k tps on master but only 48k tps with
the patch (so ~10% difference). I have no idea what scale was used for
those tests, and I see no such regression in the current results (but
you only report results for some of the client counts).

Also, which of the proposed patches have you been testing?

Can you collect and share a more complete set of data, perhaps based on
the scripts I use to do tests on the large machine with 36/72 cores,
available at https://bitbucket.org/tvondra/hp05-results ?

I've taken some time to build a simple web-based reports from the
results collected so far (also included in the git repository), and
pushed them here:

http://tvondra.bitbucket.org

For each of the completed runs, there's a report comparing tps for
different client counts with master and the three patches (average tps,
median and stddev), and it's possible to download a more thorough text
report with wait event stats, comparison of individual runs etc.

If you want to cooperate on this, I'm available - i.e. I can help you
get the tooling running, customize it etc.

Regarding the results collected on the "big machine" so far, I do have a
few observations:

pgbench / scale 300 (fits into 16GB shared buffers)
---------------------------------------------------
* in general, those results seem fine

* the results generally fall into 3 categories (I'll show results for
"pgbench -N" but regular pgbench behaves similarly):

(a) logged, sync_commit=on - no impact
http://tvondra.bitbucket.org/#pgbench-300-logged-sync-skip

(b) logged, sync_commit=off - improvement
http://tvondra.bitbucket.org/#pgbench-300-logged-async-skip

The thoughput gets improved by ~20% with 72 clients, and then it
levels-off (but does not drop unlike on master). With high client
counts the difference is up to 300%, but people who care about
throughput won't run with such client counts anyway.

And not only this improves throughput, it also significantly
reduces variability of the performance (i.e. measure throughput
each second and compute STDDEV of that). You can imagine this
as a much "smoother" chart of tps over time.

(c) unlogged, sync_commit=* - improvement
http://tvondra.bitbucket.org/#pgbench-300-unlogged-sync-skip

This is actually quite similar to (b).

dilip / scale 300 (fits into 16GB shared buffers)
-------------------------------------------------

* those results seem less OK

* I haven't found any significant regressions (in the sense of
significant performance drop compared to master), but the behavior in
some cases seem fairly strange (and it's repeatable)

* consider for example these results:

http://tvondra.bitbucket.org/#dilip-300-unlogged-async
http://tvondra.bitbucket.org/#dilip-300-logged-async

* the saw-like pattern is rather suspicious, and I don't think I've seen
anything like that before - I guess there's some feedback loop and we
better find it before committing any of the patches, because this is
something I don't want to see on any production machine (and I bet
neither do you)

* After looking into wait even details in the full text report at

http://tvondra.bitbucket.org/by-test/dilip-300-unlogged-async.txt

(section "wait events for dilip-300-unlogged-async (runs combined)")

I see that for pg-9.6-group-update, the statistics for 72, 108 and
144 clients (low - high - low), the summary looks like this

clients | wait_event_type | wait_event | wait_count | wait_pct
---------+-----------------+-----------------+------------+----------
72 | | | 374845 | 62.87
72 | Client | ClientRead | 136320 | 22.86
72 | LWLockNamed | CLogControlLock | 52804 | 8.86
72 | LWLockTranche | buffer_content | 15337 | 2.57
72 | LWLockNamed | XidGenLock | 7352 | 1.23
72 | LWLockNamed | ProcArrayLock | 6630 | 1.11

108 | | | 407179 | 46.01
108 | LWLockNamed | CLogControlLock | 300452 | 33.95
108 | LWLockTranche | buffer_content | 87597 | 9.90
108 | Client | ClientRead | 80901 | 9.14
108 | LWLockNamed | ProcArrayLock | 3290 | 0.37

144 | | | 623057 | 53.44
144 | LWLockNamed | CLogControlLock | 175072 | 15.02
144 | Client | ClientRead | 163451 | 14.02
144 | LWLockTranche | buffer_content | 147963 | 12.69
144 | LWLockNamed | XidGenLock | 38361 | 3.29
144 | Lock | transactionid | 8821 | 0.76

That is, there's sudden jump on CLogControlLock from 22% to 33% and
then back to 15% (and for 180 clients it jumps back to ~35%). That's
pretty strange, and all the patches behave exactly the same.

scale 3000 (45GB), shared_buffers=16GB
---------------------------------------

For the small scale, the whole data set fits into 16GB shared buffers,
so there were pretty much no writes except for WAL and CLOG. For scale
3000 that's no longer true - the backends will compete for buffers and
will constantly write dirty buffers to page cache.

I haven't realized this initially and the kernel was using the default
vm.dirty_* limits, i.e. 10% and 20%. As the machine has 3TB of RAM, this
resulted in rather excessive threshold (or "insane" if you want), so the
kernel regularly accumulated up to ~15GB of dirty data and then wrote it
out in very short period of time. Even though the machine has fairly
powerful storage (4GB write cache on controller, 10 x 12Gbps SAS SSDs),
this lead to pretty bad latency spikes / drops in throughput.

I've only done two runs with this configuration before realizing what's
happening, the results are illustrated here"

* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-high-dirty-bytes
*
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-high-dirty-bytes

I'm not sure how important those results are (if throughput and smooth
behavior matters, tuning the kernel thresholds is a must), but what I
find interesting is that while the patches manage to improve throughput
by 10-20%, they also (quite significantly) increase variability of the
results (jitter in the tps over time). It's particularly visible on the
pgbench results. I'm not sure that's a good tradeoff.

After fixing the kernel page cache thresholds (by setting
background_bytes to 256MB to perform smooth write-out), the effect
differs depending on the workload:

(a) dilip
http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync

- eliminates any impact of all the patches

(b) pgbench (-N)
http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip

- By far the most severe regression observed during the testing.
With 36 clients the throughput drops by ~40%, which I think is
pretty bad. Also the results are much more variable with the
patches (compared to master).

scale 3000 (45GB), shared_buffers=64GB
---------------------------------------

I've also done some tests with increased shared buffers, so that even
the large data set fits into them. Again, the results slightly depend on
the workload:

(a) dilip

* http://tvondra.bitbucket.org/#dilip-3000-unlogged-sync-64
* http://tvondra.bitbucket.org/#dilip-3000-unlogged-async-64

Pretty much no impact on throughput or variability. Unlike on the
small data set, it the patches don't even eliminate the performance
drop above 72 clients - the performance closely matches master.

(b) pgbench

* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-skip-64
* http://tvondra.bitbucket.org/#pgbench-3000-unlogged-sync-noskip-64

There's a small benefit (~20% on the same client count), and the
performance drop only happens after 72 clients. The patches also
significantly increase variability of the results, particularly for
large client counts.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2016-10-20 15:34:41 Re: Indirect indexes
Previous Message Pavan Deolasee 2016-10-20 15:30:07 Re: Indirect indexes