slab allocator performance issues

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: slab allocator performance issues
Date: 2021-07-17 19:43:33
Message-ID: 20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I just tried to use the slab allocator for a case where aset.c was
bloating memory usage substantially. First: It worked wonders for memory
usage, nearly eliminating overhead.

But it turned out to cause a *substantial* slowdown. With aset the
allocator is barely in the profile. With slab the profile is dominated
by allocator performance.

slab:
NOTICE: 00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880
Overhead Command Shared Object Symbol

+ 28.27% postgres postgres [.] SlabAlloc
+ 9.64% postgres bdbench.so [.] bfm_delete
+ 9.03% postgres bdbench.so [.] bfm_set
+ 8.39% postgres bdbench.so [.] bfm_lookup
+ 8.36% postgres bdbench.so [.] bfm_set_leaf.constprop.0
+ 8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms
+ 6.88% postgres bdbench.so [.] bfm_delete_leaf
+ 3.24% postgres libc-2.31.so [.] _int_malloc
+ 2.58% postgres bdbench.so [.] bfm_tests
+ 2.33% postgres postgres [.] SlabFree
+ 1.29% postgres libc-2.31.so [.] _int_free
+ 1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0

aset:

NOTICE: 00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880

+ 16.43% postgres bdbench.so [.] bfm_lookup
+ 15.38% postgres bdbench.so [.] bfm_delete
+ 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms
+ 12.65% postgres bdbench.so [.] bfm_set
+ 12.15% postgres bdbench.so [.] bfm_set_leaf.constprop.0
+ 10.57% postgres bdbench.so [.] bfm_delete_leaf
+ 4.05% postgres bdbench.so [.] bfm_tests
+ 2.93% postgres [kernel.vmlinux] [k] clear_page_erms
+ 1.59% postgres postgres [.] AllocSetAlloc
+ 1.15% postgres bdbench.so [.] memmove(at)plt
+ 1.06% postgres bdbench.so [.] bfm_grow_leaf_16

OS:
NOTICE: 00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec
LOCATION: bfm_test_insert_bulk, radix.c:2880

That is somewhat surprising - part of the promise of a slab allocator is
that it's fast...

This is caused by multiple issues, I think. Some of which seems fairly easy to
fix.

1) If allocations are short-lived slab.c, can end up constantly
freeing/initializing blocks. Which requires fairly expensively iterating over
all potential chunks in the block and initializing it. Just to then free that
memory again after a small number of allocations. The extreme case of this is
when there are phases of alloc/free of a single allocation.

I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
only works if the problem is with the only, so it's not a great
approach. Perhaps just keeping the last allocated block around would work?

2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
still slow enough there to be very worrisome.

I don't see a way to get around the division while keeping the freelist
structure as is. But:

ISTM that we only need the index because of the free-chunk list, right? Why
don't we make the chunk list use actual pointers? Is it concern that that'd
increase the minimum allocation size? If so, I see two ways around that:
First, we could make the index just the offset from the start of the block,
that's much cheaper to calculate. Second, we could store the next pointer in
SlabChunk->slab/block instead (making it a union) - while on the freelist we
don't need to dereference those, right?

I suspect both would also make the block initialization a bit cheaper.

That should also accelerate SlabBlockGetChunk(), which currently shows up as
an imul, which isn't exactly fast either (and uses a lot of execution ports).

3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
shows up prominently in profiles. That's ~tripling the number of cachelines
touched in the happy path, with unpredictable accesses to boot.

Perhaps we could reduce the precision of slab->freelist indexing to amortize
that cost? I.e. not have one slab->freelist entry for each nfree, but instead
have an upper limit on the number of freelists?

4) Less of a performance, and more of a usability issue: The constant
block size strikes me as problematic. Most users of an allocator can
sometimes be used with a small amount of data, and sometimes with a
large amount.

Greetings,

Andres Freund

[1] https://www.agner.org/optimize/instruction_tables.pdf

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-07-17 19:53:07 Re: slab allocator performance issues
Previous Message Andy Fan 2021-07-17 19:38:58 Re: UniqueKey on Partitioned table.