Re: slab allocator performance issues

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: slab allocator performance issues
Date: 2022-12-13 00:49:36
Message-ID: CAApHDvrnpKZrhJaz6TF0LM0Of85=eKAuE3x8STxHZ-fJBi1XMQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for testing the patch.

On Mon, 12 Dec 2022 at 20:14, John Naylor <john(dot)naylor(at)enterprisedb(dot)com> wrote:
> v13-0001 to 0005:

> 2.60% postgres postgres [.] SlabFree

> + v4 slab:

> 4.98% postgres postgres [.] SlabFree
>
> While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating.

I've tried to reproduce this with the v13 patches applied and I'm not
really getting the same as you are. To run the function 100 times I
used:

select x, a.* from generate_series(1,100) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;

(I had to add the * (1+x-x) to add a lateral dependency to stop the
function just being executed once)

v13-0001 - 0005 gives me:

37.71% postgres [.] rt_set
19.24% postgres [.] SlabAlloc
8.73% [kernel] [k] clear_page_rep
5.21% postgres [.] rt_node_insert_inner.isra.0
2.63% [kernel] [k] asm_exc_page_fault
2.24% postgres [.] SlabFree

and fairly consistently 122 ms runtime per call.

Applying v4 slab patch I get:

41.06% postgres [.] rt_set
10.84% postgres [.] SlabAlloc
9.01% [kernel] [k] clear_page_rep
6.49% postgres [.] rt_node_insert_inner.isra.0
2.76% postgres [.] SlabFree

and fairly consistently 112 ms per call.

I wonder if you can consistently get the same result on another
compiler or after patching something like master~50 or master~100.
Maybe it's just a code alignment thing.

Looking at the annotation of perf report for SlabFree with the patched
version I see:


│ /* push this chunk onto the head of the free list */
│ *(MemoryChunk **) pointer = block->freehead;
0.09 │ mov 0x10(%r8),%rax
│ slab = block->slab;
59.15 │ mov (%r8),%rbp
│ *(MemoryChunk **) pointer = block->freehead;
9.43 │ mov %rax,(%rdi)
│ block->freehead = chunk;

│ block->nfree++;

I think what that's telling me is that dereferencing the block's
memory is slow, likely due to that particular cache line not being
cached any longer. I tried running the test with 10,000 ints instead
of 500,000 so that there would be less CPU cache pressure. I see:

29.76 │ mov (%r8),%rbp
│ *(MemoryChunk **) pointer = block->freehead;
12.72 │ mov %rax,(%rdi)
│ block->freehead = chunk;

│ block->nfree++;
│ mov 0x8(%r8),%eax
│ block->freehead = chunk;
4.27 │ mov %rdx,0x10(%r8)
│ SlabBlocklistIndex():
│ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
│ mov $0x1,%edx
│ SlabFree():
│ block->nfree++;
│ lea 0x1(%rax),%edi
│ mov %edi,0x8(%r8)
│ SlabBlocklistIndex():
│ int32 blocklist_shift = slab->blocklist_shift;
│ mov 0x70(%rbp),%ecx
│ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
8.46 │ shl %cl,%edx

various other instructions in SlabFree are proportionally taking
longer now. For example the bitshift at the end was insignificant
previously. That indicates to me that this is due to caching effects.
We must fetch the block in SlabFree() in both versions. It's possible
that something is going on in SlabAlloc() that is causing more useful
cachelines to be evicted, but (I think) one of primary design goals
Andres was going for was to reduce that. For example not having to
write out the freelist for an entire block when the block is first
allocated means not having to load possibly all cache lines for the
entire block anymore.

I tried looking at perf stat during the run.

Without slab changes:

drowley(at)amd3990x:~$ sudo perf stat --pid=74922 sleep 2
Performance counter stats for process id '74922':

2,000.74 msec task-clock # 1.000 CPUs utilized
4 context-switches # 1.999 /sec
0 cpu-migrations # 0.000 /sec
578,139 page-faults # 288.963 K/sec
8,614,687,392 cycles # 4.306 GHz
(83.21%)
682,574,688 stalled-cycles-frontend # 7.92% frontend
cycles idle (83.33%)
4,822,904,271 stalled-cycles-backend # 55.98% backend
cycles idle (83.41%)
11,447,124,105 instructions # 1.33 insn per cycle
# 0.42 stalled
cycles per insn (83.41%)
1,947,647,575 branches # 973.464 M/sec
(83.41%)
13,914,897 branch-misses # 0.71% of all
branches (83.24%)

2.000924020 seconds time elapsed

With slab changes:

drowley(at)amd3990x:~$ sudo perf stat --pid=75967 sleep 2
Performance counter stats for process id '75967':

2,000.89 msec task-clock # 1.000 CPUs utilized
1 context-switches # 0.500 /sec
0 cpu-migrations # 0.000 /sec
607,423 page-faults # 303.576 K/sec
8,566,091,176 cycles # 4.281 GHz
(83.21%)
737,839,390 stalled-cycles-frontend # 8.61% frontend
cycles idle (83.32%)
4,454,357,725 stalled-cycles-backend # 52.00% backend
cycles idle (83.41%)
10,760,559,837 instructions # 1.26 insn per cycle
# 0.41 stalled
cycles per insn (83.41%)
1,872,047,962 branches # 935.606 M/sec
(83.41%)
14,928,953 branch-misses # 0.80% of all
branches (83.25%)

2.000960610 seconds time elapsed

It would be interesting to see if your perf stat output is showing
something significantly different with and without the slab changes.

It does not seem impossible that due to the slab changes having to
look at less memory in SlabAlloc() that that's moving some additional
requirements for SlabFree() to fetch cache lines that in the unpatched
version would have already been available. If that is the case, then
I think we shouldn't worry about it unless we can find some workload
that demonstrates an overall performance regression with the patch. I
just don't quite have enough perf experience to know how I might go
about proving that.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-12-13 00:59:58 Re: New strategies for freezing, advancing relfrozenxid early
Previous Message Joseph Koshakow 2022-12-13 00:11:16 Re: Date-Time dangling unit fix