Re: slab allocator performance issues

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: slab allocator performance issues
Date: 2021-07-17 20:35:07
Message-ID: c3bf99d3-0dec-828e-e3b5-ed2d2922bdd8@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 7/17/21 9:43 PM, Andres Freund wrote:
> Hi,
>
> I just tried to use the slab allocator for a case where aset.c was
> bloating memory usage substantially. First: It worked wonders for memory
> usage, nearly eliminating overhead.
>
> But it turned out to cause a *substantial* slowdown. With aset the
> allocator is barely in the profile. With slab the profile is dominated
> by allocator performance.
>
> slab:
> NOTICE: 00000: 100000000 ordered insertions in 5.216287 seconds, 19170724/sec
> LOCATION: bfm_test_insert_bulk, radix.c:2880
> Overhead Command Shared Object Symbol
>
> + 28.27% postgres postgres [.] SlabAlloc
> + 9.64% postgres bdbench.so [.] bfm_delete
> + 9.03% postgres bdbench.so [.] bfm_set
> + 8.39% postgres bdbench.so [.] bfm_lookup
> + 8.36% postgres bdbench.so [.] bfm_set_leaf.constprop.0
> + 8.16% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms
> + 6.88% postgres bdbench.so [.] bfm_delete_leaf
> + 3.24% postgres libc-2.31.so [.] _int_malloc
> + 2.58% postgres bdbench.so [.] bfm_tests
> + 2.33% postgres postgres [.] SlabFree
> + 1.29% postgres libc-2.31.so [.] _int_free
> + 1.09% postgres libc-2.31.so [.] unlink_chunk.constprop.0
>
> aset:
>
> NOTICE: 00000: 100000000 ordered insertions in 2.082602 seconds, 48016848/sec
> LOCATION: bfm_test_insert_bulk, radix.c:2880
>
> + 16.43% postgres bdbench.so [.] bfm_lookup
> + 15.38% postgres bdbench.so [.] bfm_delete
> + 12.82% postgres libc-2.31.so [.] __memmove_avx_unaligned_erms
> + 12.65% postgres bdbench.so [.] bfm_set
> + 12.15% postgres bdbench.so [.] bfm_set_leaf.constprop.0
> + 10.57% postgres bdbench.so [.] bfm_delete_leaf
> + 4.05% postgres bdbench.so [.] bfm_tests
> + 2.93% postgres [kernel.vmlinux] [k] clear_page_erms
> + 1.59% postgres postgres [.] AllocSetAlloc
> + 1.15% postgres bdbench.so [.] memmove(at)plt
> + 1.06% postgres bdbench.so [.] bfm_grow_leaf_16
>
> OS:
> NOTICE: 00000: 100000000 ordered insertions in 2.089790 seconds, 47851690/sec
> LOCATION: bfm_test_insert_bulk, radix.c:2880
>
>
> That is somewhat surprising - part of the promise of a slab allocator is
> that it's fast...
>
>
> This is caused by multiple issues, I think. Some of which seems fairly easy to
> fix.
>
> 1) If allocations are short-lived slab.c, can end up constantly
> freeing/initializing blocks. Which requires fairly expensively iterating over
> all potential chunks in the block and initializing it. Just to then free that
> memory again after a small number of allocations. The extreme case of this is
> when there are phases of alloc/free of a single allocation.
>
> I "fixed" this by adding a few && slab->nblocks > 1 in SlabFree() and the
> problem vastly reduced. Instead of a 0.4x slowdown it's 0.88x. Of course that
> only works if the problem is with the only, so it's not a great
> approach. Perhaps just keeping the last allocated block around would work?
>

+1

I think it makes perfect sense to not free the blocks immediately, and
keep one (or a small number) as a cache. I'm not sure why we decided not
to have a "keeper" block, but I suspect memory consumption was my main
concern at that point. But I never expected the cost to be this high.

>
> 2) SlabChunkIndex() in SlabFree() is slow. It requires a 64bit division, taking
> up ~50% of the cycles in SlabFree(). A 64bit div, according to [1] , has a
> latency of 35-88 cycles on skylake-x (and a reverse throughput of 21-83,
> i.e. no parallelism). While it's getting a bit faster on icelake / zen 3, it's
> still slow enough there to be very worrisome.
>
> I don't see a way to get around the division while keeping the freelist
> structure as is. But:
>
> ISTM that we only need the index because of the free-chunk list, right? Why
> don't we make the chunk list use actual pointers? Is it concern that that'd
> increase the minimum allocation size? If so, I see two ways around that:
> First, we could make the index just the offset from the start of the block,
> that's much cheaper to calculate. Second, we could store the next pointer in
> SlabChunk->slab/block instead (making it a union) - while on the freelist we
> don't need to dereference those, right?
>
> I suspect both would also make the block initialization a bit cheaper.
>
> That should also accelerate SlabBlockGetChunk(), which currently shows up as
> an imul, which isn't exactly fast either (and uses a lot of execution ports).
>

Hmm, I think you're right we could simply use the pointers, but I have
not tried that.

>
> 3) Every free/alloc needing to unlink from slab->freelist[i] and then relink
> shows up prominently in profiles. That's ~tripling the number of cachelines
> touched in the happy path, with unpredictable accesses to boot.
>
> Perhaps we could reduce the precision of slab->freelist indexing to amortize
> that cost? I.e. not have one slab->freelist entry for each nfree, but instead
> have an upper limit on the number of freelists?
>

Yeah. The purpose of organizing the freelists like this is to prioritize
the "more full" blocks" when allocating new chunks, in the hope that the
"less full" blocks will end up empty and freed faster.

But this is naturally imprecise, and strongly depends on the workload,
of course, and I bet for most cases a less precise approach would work
just as fine.

I'm not sure how exactly would the upper limit you propose work, but
perhaps we could group the blocks for nfree ranges, say [0-15], [16-31]
and so on. So after the alloc/free we'd calculate the new freelist index
as (nfree/16) and only moved the block if the index changed. This would
reduce the overhead to 1/16 and probably even more in practice.

Of course, we could also say we have e.g. 8 freelists and work the
ranges backwards from that, I guess that's what you propose.

>
> 4) Less of a performance, and more of a usability issue: The constant
> block size strikes me as problematic. Most users of an allocator can
> sometimes be used with a small amount of data, and sometimes with a
> large amount.
>

I doubt this is worth the effort, really. The constant block size makes
various places much simpler (both to code and reason about), so this
should not make a huge difference in performance. And IMHO the block
size is mostly an implementation detail, so I don't see that as a
usability issue.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-07-17 21:14:05 Re: slab allocator performance issues
Previous Message Yugo NAGATA 2021-07-17 19:55:05 corruption of WAL page header is never reported