Re: Possible performance regression in version 10.1 with pgbench read-write tests.

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Date: 2018-07-20 20:43:33
Message-ID: 13622.1532119413@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2018-07-20 15:35:39 -0400, Tom Lane wrote:
>> In any case, I strongly resist making performance-based changes on
>> the basis of one test on one kernel and one hardware platform.

> Sure, it'd be good to do more of that. But from a theoretical POV it's
> quite logical that posix semas sharing cachelines is bad for
> performance, if there's any contention. When backed by futexes -
> i.e. all non ancient linux machines - the hot path just does a cmpxchg
> of the *userspace* data (I've copied the relevant code below).

Here's the thing: the hot path is of little or no interest, because
if we are in the sema code at all, we are expecting to block. The
only case where we wouldn't block is if the lock manager decided the
current process needs to sleep, but some other process already released
us by the time we reach the futex/kernel call. Certainly that will happen
some of the time, but it's not likely to be the way to bet. So I'm very
dubious of any arguments based on the speed of the "uncontended" path.

It's possible that the bigger picture here is that the kernel boys
optimized for the "uncontended" path to the point where they broke
performance of the blocking path. It's hard to see how they could
have broke it to the point of being slower than the SysV sema API,
though.

Anyway, I think we need to test first and patch second. I'm working
on getting some numbers on my own machines now.

On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've
only got 8 cores) but other parameters matching Mithun's example,
I just got

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: prepared
number of clients: 8
number of threads: 8
duration: 1800 s
number of transactions actually processed: 29001016
latency average = 0.497 ms
tps = 16111.575661 (including connections establishing)
tps = 16111.623329 (excluding connections establishing)

which is interesting because vmstat was pretty consistently reporting
around 500000 context swaps/second during the run, or circa 30
cs/transaction. We'd have a minimum of 14 cs/transaction just between
client and server (due to seven SQL commands per transaction in TPC-B)
so that seems on the low side; not a lot of lock contention here it
seems. I wonder what the corresponding ratio was in Mithun's runs.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-07-20 20:55:41 Re: Possible performance regression in version 10.1 with pgbench read-write tests.
Previous Message Andres Freund 2018-07-20 20:24:50 Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes