scalability bottlenecks with (many) partitions (and more)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: scalability bottlenecks with (many) partitions (and more)
Date: 2024-01-28 21:57:02
Message-ID: 510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I happened to investigate a query involving a partitioned table, which
led me to a couple of bottlenecks severely affecting queries dealing
with multiple partitions (or relations in general). After a while I came
up with three WIP patches that improve the behavior by an order of
magnitude, and not just in some extreme cases.

Consider a partitioned pgbench with 20 partitions, say:

pgbench -i -s 100 --partitions 100 testdb

but let's modify the pgbench_accounts a little bit:

ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT;
UPDATE pgbench_accounts SET aid_parent = aid;
CREATE INDEX ON pgbench_accounts(aid_parent);
VACUUM FULL pgbench_accounts;

which simply adds "aid_parent" column which is not a partition key. And
now let's do a query

SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb
ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid

so pretty much the regular "pgbench -S" except that on the column that
does not allow partition elimination. Now, the plan looks like this:

QUERY PLAN
----------------------------------------------------------------------
Hash Join (cost=1.52..34.41 rows=10 width=465)
Hash Cond: (pa.bid = pb.bid)
-> Append (cost=0.29..33.15 rows=10 width=101)
-> Index Scan using pgbench_accounts_1_aid_parent_idx on
pgbench_accounts_1 pa_1 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_2_aid_parent_idx on
pgbench_accounts_2 pa_2 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_3_aid_parent_idx on
pgbench_accounts_3 pa_3 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> Index Scan using pgbench_accounts_4_aid_parent_idx on
pgbench_accounts_4 pa_4 (cost=0.29..3.31 rows=1 width=101)
Index Cond: (aid_parent = 3489734)
-> ...
-> Hash (cost=1.10..1.10 rows=10 width=364)
-> Seq Scan on pgbench_branches pb (cost=0.00..1.10 rows=10
width=364)

So yeah, scanning all 100 partitions. Not great, but no partitioning
scheme is perfect for all queries. Anyway, let's see how this works on a
big AMD EPYC machine with 96/192 cores - with "-M simple" we get:

parts 1 8 16 32 64 96 160 224
-----------------------------------------------------------------------
0 13877 105732 210890 410452 709509 844683 1050658 1163026
100 653 3957 7120 12022 12707 11813 10349 9633
1000 20 142 270 474 757 808 567 427

These are transactions per second, for different number of clients
(numbers in the header). With -M prepared the story doesn't change - the
numbers are higher, but the overall behavior is pretty much the same.

Firstly, with no partitions (first row), the throughput by ~13k/client
initially, then it gradually levels off. But it grows all the time.

But with 100 or 1000 partitions, it peaks and then starts dropping
again. And moreover, the throughput with 100 or 1000 partitions is just
a tiny fraction of the non-partitioned value. The difference is roughly
equal to the number of partitions - for example with 96 clients, the
difference between 0 and 1000 partitions is 844683/808 = 1045.

I could demonstrate the same behavior with fewer partitions - e.g. with
10 partitions you get ~10x difference, and so on.

Another thing I'd mention is that this is not just about partitioning.
Imagine a star schema with a fact table and dimensions - you'll get the
same behavior depending on the number of dimensions you need to join
with. With "-M simple" you may get this, for example:

dims 1 8 16 32 64 96 160 224
----------------------------------------------------------------------
1 11737 92925 183678 361497 636598 768956 958679 1042799
10 462 3558 7086 13889 25367 29503 25353 24030
100 4 31 61 122 231 292 292 288

So, similar story - significant slowdown as we're adding dimensions.

Now, what could be causing this? Clearly, there's a bottleneck of some
kind, and we're hitting it. Some of this may be simply due to execution
doing more stuff (more index scans, more initialization, ...) but maybe
not - one of the reasons why I started looking into this was not using
all the CPU even for small scales - the CPU was maybe 60% utilized.

So I started poking at things. The first thing that I thought about was
locking, obviously. That's consistent with the limited CPU utilization
(waiting on a lock = not running), and it's somewhat expected when using
many partitions - we need to lock all of them, and if we have 100 or
1000 of them, that's potentially lot of locks.

From past experiments I've known about two places where such bottleneck
could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to
give it a try, increase these values and see what happens.

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

For fast-path locking the changes are more complicated (see 0002). We
allow keeping 16 relation locks right in PGPROC, and only when this gets
full we promote them to the actual lock table. But with enough
partitions we're guaranteed to fill these 16 slots, of course. But
increasing the number of slots is not simple - firstly, the information
is split between an array of 16 OIDs and UINT64 serving as a bitmap.
Increasing the size of the OID array is simple, but it's harder for the
auxiliary bitmap. But there's more problems - with more OIDs a simple
linear search won't do. But a simple hash table is not a good idea too,
because of poor locality and the need to delete stuff ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

Unfortunately, for the pgbench join this does not make much difference.
But for the "star join" (with -M prepared) it does this:

1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 21610 137450 247541 300902 270932 229692 191454 189233
patched 21664 151695 301451 594615 1036424 1211716 1480953 1656203
speedup 1.0 1.1 1.2 2.0 3.8 5.3 7.7 8.8

That's a pretty nice speedup, I think.

However, why doesn't the partitioned join improve (at not very much)?
Well, perf profile says stuff like this:

9.16% 0.77% postgres [kernel.kallsyms] [k] asm_exc_page_fault
|
--8.39%--asm_exc_page_fault
|
--7.52%--exc_page_fault
|
--7.13%--do_user_addr_fault
|
--6.64%--handle_mm_fault
|
--6.29%--__handle_mm_fault
|
|--2.17%--__mem_cgroup_charge
| |
| |--1.25%--charge_memcg
| | |
| | --0.57%-- ...
| |
| --0.67%-- ...
|
|--2.04%--vma_alloc_folio

After investigating this for a bit, I came to the conclusion this may be
some sort of a scalability problem in glibc/malloc. I decided to try if
the "memory pool" patch (which I've mentioned in the memory limit thread
as an alternative way to introduce backend-level accounting/limit) could
serve as a backend-level malloc cache, and how would that work. So I
cleaned up the PoC patch I already had (see 0003), and gave it a try.

And with both patches applied, the results for the partitioned join with
100 partitions look like this:

-M simple

1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 653 3957 7120 12022 12707 11813 10349 9633
both patches 954 7356 14580 28259 51552 65278 70607 69598
speedup 1.5 1.9 2.0 2.4 4.1 5.5 6.8 7.2

-M prepared

1 8 16 32 64 96 160 224
------------------------------------------------------------------------
master 1639 8273 14138 14746 13446 14001 11129 10136
both patches 4792 30102 62208 122157 220984 267763 315632 323567
speedup 2.9 3.6 4.4 8.3 16.4 19.1 28.4 31.9

That's pretty nice, I think. And I've seen many such improvements, it's
not a cherry-picked example. For the star join, the improvements are
very similar.

I'm attaching PDF files with a table visualizing results for these two
benchmarks - there's results for different number of partitions/scales,
and different builds (master, one or both of the patches). There's also
a comparison to master, with color scale "red = slower, green = faster"
(but there's no red anywhere, not even for low client counts).

It's also interesting that with just the 0003 patch applied, the change
is much smaller. It's as if the two bottlenecks (locking and malloc) are
in balance - if you only address one one, you don't get much. But if you
address both, it flies.

FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:

so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));

and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubt
it's the only such allocation.

I don't want to get into too much detail about the memory pool, but I
think it's something we should consider doing - I'm sure there's stuff
to improve, but caching the malloc may clearly be very beneficial. The
basic idea is to have a cache that is "adaptive" (i.e. adjusts to
caching blocks of sizes needed by the workload) but also cheap. The
patch is PoC/WIP and needs more work, but I think it works quite well.
If anyone wants to take a look or have a chat at FOSDEM, for example,
I'm available.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

FWIW there's another bottleneck people may not realize, and that's the
number of file descriptors. Once you get to >1000 relations, you can
easily get into situation like this:

54.18% 0.48% postgres [kernel.kallsyms] [k]
entry_SYSCALL_64_after_hwframe
|
--53.70%--entry_SYSCALL_64_after_hwframe
|
--53.03%--do_syscall_64
|
|--28.29%--__x64_sys_openat
| |
| --28.14%--do_sys_openat2
| |
| |--23.14%--do_filp_open
| | |
| | --22.72%--path_openat

That's pretty bad, it means we're closing/opening file descriptors like
crazy, because every query needs the files. If I increase the number of
file descriptors (both in ulimit and max_files_per_process) to prevent
this trashing, I can increase the throughput ~5x. Of course, this is not
a bottleneck that we can "fix" in code, it's simply a consequence of not
having enough file descriptors etc. But I wonder if we might make it
easier to monitor this, e.g. by tracking the fd cache hit ratio, or
something like that ...

There's a more complete set of benchmarking scripts and results for
these and other tests, in various formats (PDF, ODS, ...) at

https://github.com/tvondra/scalability-patches

There's results from multiple machines - not just the big epyc machine,
but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it
helps even on rpi5, quite a bit).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
join - epyc _ builds.pdf application/pdf 85.0 KB
star - epyc _ builds.pdf application/pdf 65.6 KB
v240118-0001-Increase-NUM_LOCK_PARTITIONS-to-64.patch text/x-patch 1.3 KB
v240118-0002-Increase-the-number-of-fastpath-locks.patch text/x-patch 12.8 KB
v240118-0003-Add-a-memory-pool-with-adaptive-rebalancing.patch text/x-patch 37.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-01-28 23:11:32 Re: Add recovery to pg_control and remove backup_label
Previous Message Jelte Fennema-Nio 2024-01-28 21:52:06 Re: proposal: psql: show current user in prompt