Re: Some performance degradation in REL_16 vs REL_15

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Anton A(dot) Melnikov" <a(dot)melnikov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, iamqyh(at)gmail(dot)com
Subject: Re: Some performance degradation in REL_16 vs REL_15
Date: 2023-11-15 20:21:33
Message-ID: 20231115202133.4iqp5u6ekmpzgaqr@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-15 10:09:06 -0500, Tom Lane wrote:
> "Anton A. Melnikov" <a(dot)melnikov(at)postgrespro(dot)ru> writes:
> > I can't understand why i get the opposite results on my pc and on the server. It is clear that the absolute
> > TPS values will be different for various configurations. This is normal. But differences?
> > Is it unlikely that some kind of reference configuration is needed to accurately
> > measure the difference in performance. Probably something wrong with my pc, but now
> > i can not figure out what's wrong.
>
> > Would be very grateful for any advice or comments to clarify this problem.
>
> Benchmarking is hard :-(.

Indeed.

> IME it's absolutely typical to see variations of a couple of percent even
> when "nothing has changed", for example after modifying some code that's
> nowhere near any hot code path for the test case. I usually attribute this
> to cache effects, such as a couple of bits of hot code now sharing or not
> sharing a cache line.

FWIW, I think we're overusing that explanation in our community. Of course you
can encounter things like this, but the replacement policies of cpu caches
have gotten a lot better and the caches have gotten bigger too.

IME this kind of thing is typically dwarfed by much bigger variations from
things like

- cpu scheduling - whether the relevant pgbench thread is colocated on the
same core as the relevant backend can make a huge difference,
particularly when CPU power saving modes are not disabled. Just looking at
tps from a fully cached readonly pgbench, with a single client:

Power savings enabled, same core:
37493

Power savings enabled, different core:
28539

Power savings disabled, same core:
38167

Power savings disabled, different core:
37365

- can transparent huge pages be used for the executable mapping, or not

On newer kernels linux (and some filesystems) can use huge pages for the
executable. To what degree that succeeds is a large factor in performance.

Single threaded read-only pgbench

postgres mapped without huge pages:
37155 TPS

with 2MB of postgres as huge pages:
37695 TPS

with 6MB of postgres as huge pages:
42733 TPS

The really annoying thing about this is that entirely unpredictable whether
huge pages are used or not. Building the same way, sometimes 0, sometimes 2MB,
sometimes 6MB are mapped huge. Even though the on-disk contents are
precisely the same. And it can even change without rebuilding, if the
binary is evicted from the page cache.

This alone makes benchmarking extremely annoying. It basically can't be
controlled and has huge effects.

- How long has the server been started

If e.g. once you run your benchmark on the first connection to a database,
and after a restart not (e.g. autovacuum starts up beforehand), you can get
a fairly different memory layout and cache situation, due to [not] using the
relcache init file. If not, you'll have a catcache that's populated,
otherwise not.

Another mean one is whether you start your benchmark within a relatively
short time of the server starting. Readonly pgbench with a single client,
started immediately after the server:

progress: 12.0 s, 37784.4 tps, lat 0.026 ms stddev 0.001, 0 failed
progress: 13.0 s, 37779.6 tps, lat 0.026 ms stddev 0.001, 0 failed
progress: 14.0 s, 37668.2 tps, lat 0.026 ms stddev 0.001, 0 failed
progress: 15.0 s, 32133.0 tps, lat 0.031 ms stddev 0.113, 0 failed
progress: 16.0 s, 37564.9 tps, lat 0.027 ms stddev 0.012, 0 failed
progress: 17.0 s, 37731.7 tps, lat 0.026 ms stddev 0.001, 0 failed

There's a dip at 15s, odd - turns out that's due to bgwriter writing a WAL
record, which triggers walwriter to write it out and then initialize the
whole WAL buffers with 0s - happens once. In this case I've exagerated the
effect a bit by using a 1GB wal_buffers, but it's visible otherwise too.
Whether your benchmark period includes that dip or not adds a fair bit of
noise.

You can even see the effects of autovacuum workers launching - even if
there's nothing to do! Not as a huge dip, but enough to add some "run to
run" variation.

- How much other dirty data is there in the kernel pagecache. If you e.g. just
built a new binary, even with just minor changes, the kernel will need to
flush those pages eventually, which may contend for IO and increases page
faults.

Rebuilding an optimized build generates something like 1GB of dirty
data. Particularly with ccache, that'll typically not yet be flushed by the
time you run a benchmark. That's not nothing, even with a decent NVMe SSD.

- many more, unfortunately

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shankaran, Akash 2023-11-15 20:27:57 RE: Popcount optimization using AVX512
Previous Message Jacob Champion 2023-11-15 20:20:56 Re: [PoC] Federated Authn/z with OAUTHBEARER