Re: strange perf regression with data checksums

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange perf regression with data checksums
Date: 2025-05-22 12:56:33
Message-ID: bd8d04ec-11f9-443c-b431-c3f65ab04b96@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I finally had time to do more rigorous testing on the v1/v2 patches.
Attached is a .tgz with test script that initializes a pgbench scale 1,
and then:

* Modifies the data to have different patterns / number of matching
rows, etc. This is dobe by scripts in init/ directory.

* Runs queries that either match or do not match any rows. This is
done by scripts in select/ directory.

* 32, 64 and 96 clients (the system has ~96 cores)

The scripts also force a particular scan type (bitmap/index/index-only),
and may also pin the processes to CPUs in different ways:

* default = no pinning, it's up to scheduler
* colocated = pgbench/backend always on the same core
* random = pgbench/backend always on a different random core

This is done by a custom pgbench patch (can share, if needed). I found
the pinning may have *massive* impact in some cases.

There's also CSV with raw results, and two PDF files with a summary of
the results:

* results-relative-speedup-vs-master.pdf - Shows throughput relative
to master (for the same client count), 100% means no difference.

* results-relative-speedup-vs-32.pdf - Slightly different view on the
data, showing "scalability" for a given build. It compares
throughput to "expected" multiple of the result we got for 32
clients. 100% means linear scalability.

As usual, green=good, red=bad. My observation is that v2 performs better
than v1 (more green, darker green). v2 helps even in cases where v1 did
not make any difference (e.g. some of the "nomatch" cases).

It's also interesting how much impact the pinnig has - the "colocated"
results are much better. It's also interesting that in a couple cases we
scale superlinearly, i.e. 96 has better throughput than 3x that of 32
clients.

I've seen this before, and I believe it's due to behavior of the
hardware, and some kernel optimizations. Perhaps there's something we
could learn from this, not sure.

Anyway, as a comparison of v1 and v2 I think this is enough.

regards

--
Tomas Vondra

Attachment Content-Type Size
results-relative-speedup-vs-32.pdf application/pdf 64.5 KB
results-relative-speedup-vs-master.pdf application/pdf 62.8 KB
test-scripts.tgz application/x-compressed-tar 33.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2025-05-22 13:04:45 Re: generic plans and "initial" pruning
Previous Message Amit Kapila 2025-05-22 12:21:23 Re: Make wal_receiver_timeout configurable per subscription