Re: Possible performance regression in version 10.1 with pgbench read-write tests.

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Date: 2018-07-23 03:40:13
Message-ID: CAEepm=31A_tpsgdHP8evosoesq9qBNt5_dTk4CR+TYs5Wzr4AQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jul 22, 2018 at 8:19 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2018-07-20 16:43:33 -0400, Tom Lane wrote:
>>> On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
>>> only got 8 cores) but other parameters matching Mithun's example,
>>> I just got
>
>> It's *really* common to have more actual clients than cpus for oltp
>> workloads, so I don't think it's insane to test with more clients.
>
> I finished a set of runs using similar parameters to Mithun's test except
> for using 8 clients, and another set using 72 clients (but, being
> impatient, 5-minute runtime) just to verify that the results wouldn't
> be markedly different. I got TPS numbers like this:
>
> 8 clients 72 clients
>
> unmodified HEAD 16112 16284
> with padding patch 16096 16283
> with SysV semas 15926 16064
> with padding+SysV 15949 16085
>
> This is on RHEL6 (kernel 2.6.32-754.2.1.el6.x86_64), hardware is dual
> 4-core Intel E5-2609 (Sandy Bridge era). This hardware does show NUMA
> effects, although no doubt less strongly than Mithun's machine.
>
> I would like to see some other results with a newer kernel. I tried to
> repeat this test on a laptop running Fedora 28, but soon concluded that
> anything beyond very short runs was mainly going to tell me about thermal
> throttling :-(. I could possibly get repeatable numbers from, say,
> 1-minute SELECT-only runs, but that would be a different test scenario,
> likely one with a lot less lock contention.

I did some testing on 2-node, 4-node and 8-node systems running Linux
3.10.something (slightly newer but still ancient). Only the 8-node
box (= same one Mithun used) shows the large effect (the 2-node box
may be a tiny bit faster patched but I'm calling that noise for now...
it's not slower, anyway).

On the problematic box, I also tried some different strides (char
padding[N - sizeof(sem_t)]) and was surprised by the result:

Unpatched = ~35k TPS
64 byte stride = ~35k TPS
128 byte stride = ~42k TPS
4096 byte stride = ~47k TPS

Huh. PG_CACHE_LINE_SIZE is 128, but the true cache line size on this
system is 64 bytes. That exaggeration turned out to do something
useful, though I can't explain it.

While looking for discussion of 128 byte cache effects I came across
the Intel "L2 adjacent cache line prefetcher"[1]. Maybe this, or some
of the other prefetchers (enabled in the BIOS) or related stuff could
be at work here. It could be microarchitecture-dependent (this is an
old Westmere box), though I found a fairly recent discussion about a
similar effect[2] that mentions more recent hardware. The spatial
prefetcher reference can be found in the Optimization Manual[3].

[1] https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
[2] https://groups.google.com/forum/#!msg/mechanical-sympathy/i3-M2uCYTJE/P7vyoOTIAgAJ
[3] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-07-23 03:45:18 Re: Get Columns from Plan
Previous Message Michael Paquier 2018-07-23 03:39:36 Re: Non-portable shell code in pg_upgrade tap tests