Re: Speed up Clog Access by increasing CLOG buffers

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Speed up Clog Access by increasing CLOG buffers
Date: 2016-09-20 22:18:55
Message-ID: a87bfbfb-6511-b559-bab6-5966b7aabb8e@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 09/19/2016 09:10 PM, Robert Haas wrote:
>
> It's possible that the effect of this patch depends on the number of
> sockets. EDB test machine cthulhu as 8 sockets, and power2 has 4
> sockets. I assume Dilip's tests were run on one of those two,
> although he doesn't seem to have mentioned which one. Your system is
> probably 2 or 4 sockets, which might make a difference. Results
> might also depend on CPU architecture; power2 is, unsurprisingly, a
> POWER system, whereas I assume you are testing x86. Maybe somebody
> who has access should test on hydra.pg.osuosl.org, which is a
> community POWER resource. (Send me a private email if you are a known
> community member who wants access for benchmarking purposes.)
>

Yes, I'm using x86 machines:

1) large but slightly old
- 4 sockets, e5-4620 (so a bit old CPU, 32 cores in total)
- kernel 3.2.80

2) smaller but fresh
- 2 sockets, e5-2620 v4 (newest type of Xeons, 16 cores in total)
- kernel 4.8.0

> Personally, I find the results so far posted on this thread
> thoroughly unimpressive. I acknowledge that Dilip's results appear
> to show that in a best-case scenario these patches produce a rather
> large gain. However, that gain seems to happen in a completely
> contrived scenario: astronomical client counts, unlogged tables, and
> a test script that maximizes pressure on CLogControlLock. If you
> have to work that hard to find a big win, and tests under more
> reasonable conditions show no benefit, it's not clear to me that it's
> really worth the time we're all spending benchmarking and reviewing
> this, or the risk of bugs, or the damage to the SLRU abstraction
> layer. I think there's a very good chance that we're better off
> moving on to projects that have a better chance of helping in the
> real world.

I'm posting results from two types of workloads - traditional r/w
pgbench and Dilip's transaction. With synchronous_commit on/off.

Full results (including script driving the benchmark) are available
here, if needed:

https://bitbucket.org/tvondra/group-clog-benchmark/src

It'd be good if someone could try reproduce this on a comparable
machine, to rule out my stupidity.

2 x e5-2620 v4 (16 cores, 32 with HT)
=====================================

On the "smaller" machine the results look like this - I have only tested
up to 64 clients, as higher values seem rather uninteresting on a
machine with only 16 physical cores.

These are averages of 5 runs, where the min/max for each group are
within ~5% in most cases (see the "spread" sheet). The "e5-2620" sheet
also shows the numbers as % compared to master.

dilip / sync=off 1 4 8 16 32 64
----------------------------------------------------------------------
master 4756 17672 35542 57303 74596 82138
granular-locking 4745 17728 35078 56105 72983 77858
no-content-lock 4646 17650 34887 55794 73273 79000
group-update 4582 17757 35383 56974 74387 81794

dilip / sync=on 1 4 8 16 32 64
----------------------------------------------------------------------
master 4819 17583 35636 57437 74620 82036
granular-locking 4568 17816 35122 56168 73192 78462
no-content-lock 4540 17662 34747 55560 73508 79320
group-update 4495 17612 35474 57095 74409 81874

pgbench / sync=off 1 4 8 16 32 64
----------------------------------------------------------------------
master 3791 14368 27806 43369 54472 62956
granular-locking 3822 14462 27597 43173 56391 64669
no-content-lock 3725 14212 27471 43041 55431 63589
group-update 3895 14453 27574 43405 56783 62406

pgbench / sync=on 1 4 8 16 32 64
----------------------------------------------------------------------
master 3907 14289 27802 43717 56902 62916
granular-locking 3770 14503 27636 44107 55205 63903
no-content-lock 3772 14111 27388 43054 56424 64386
group-update 3844 14334 27452 43621 55896 62498

There's pretty much no improvement at all - most of the results are
within 1-2% of master, in both directions. Hardly a win.

Actually, with 1 client there seems to be ~5% regression, but it might
also be noise and verifying it would require further testing.

4 x e5-4620 v1 (32 cores, 64 with HT)
=====================================

These are averages of 10 runs, and there are a few strange things here.

Firstly, for Dilip's workload the results get much (much) worse between
64 and 128 clients, for some reason. I suspect this might be due to
fairly old kernel (3.2.80), so I'll reboot the machine with 4.5.x kernel
and try again.

Secondly, the min/max differences get much larger than the ~5% on the
smaller machine - with 128 clients, the (max-min)/average is often
>100%. See the "spread" or "spread2" sheets in the attached file.

But for some reason this only affects Dilip's workload, and apparently
the patches make it measurably worse (master is ~75%, patches ~120%). If
you look at tps for individual runs, there's usually 9 runs with almost
the same performance, and then one or two much faster runs. Again, the
pgbench seems not to have this issue.

I have no idea what's causing this - it might be related to the kernel,
but I'm not sure why it should affect the patches differently. Let's see
how the new kernel affects this.

dilip / sync=off 16 32 64 128 192
--------------------------------------------------------------
master 26198 37901 37211 14441 8315
granular-locking 25829 38395 40626 14299 8160
no-content-lock 25872 38994 41053 14058 8169
group-update 26503 38911 42993 19474 8325

dilip / sync=on 16 32 64 128 192
--------------------------------------------------------------
master 26138 37790 38492 13653 8337
granular-locking 25661 38586 40692 14535 8311
no-content-lock 25653 39059 41169 14370 8373
group-update 26472 39170 42126 18923 8366

pgbench / sync=off 16 32 64 128 192
--------------------------------------------------------------
master 23001 35762 41202 31789 8005
granular-locking 23218 36130 42535 45850 8701
no-content-lock 23322 36553 42772 47394 8204
group-update 23129 36177 41788 46419 8163

pgbench / sync=on 16 32 64 128 192
--------------------------------------------------------------
master 22904 36077 41295 35574 8297
granular-locking 23323 36254 42446 43909 8959
no-content-lock 23304 36670 42606 48440 8813
group-update 23127 36696 41859 46693 8345

So there is some improvement due to the patches for 128 clients (+30% in
some cases), but it's rather useless as 64 clients either give you
comparable performance (pgbench workload) or way better one (Dilip's
workload).

Also, pretty much no difference between synchronous_commit on/off,
probably thanks to running on unlogged tables.

I'll repeat the test on the 4-socket machine with a newer kernel, but
that's probably the last benchmark I'll do for this patch for now. I
agree with Robert that the cases the patch is supposed to improve are a
bit contrived because of the very high client counts.

IMHO to continue with the patch (or even with testing it), we really
need a credible / practical example of a real-world workload that
benefits from the patches. The closest we have to that is Amit's
suggestion someone hit the commit lock when running HammerDB, but we
have absolutely no idea what parameters they were using, except that
they were running with synchronous_commit=off. Pgbench shows no such
improvements (at least for me), at least with reasonable parameters.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
results.ods application/vnd.oasis.opendocument.spreadsheet 93.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2016-09-20 23:13:42 Re: Tracking wait event for latches
Previous Message Mark Dilger 2016-09-20 21:33:17 Re: gratuitous casting away const