Reduce function call costs on ELF platforms

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Reduce function call costs on ELF platforms
Date: 2021-11-22 21:50:48
Message-ID: 20211122215048.2ryxchocmtbmnwmp@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

There's two related, but somewhat different aspects to $subject.

TL;DR: We use use -fvisibility=hidden + explicit symbol visiblity,
-Wl,-Bdynamic, -fno-plt

1) Cross-translation-unit calls in extension library

A while ago I was looking at a profile of a workload that spent a good chunk
of time in an extension. Looking at the instruction level profile it showed
that some of that time was spent doing more-complicated-than-necessary
function calls to other functions within the extension.

Basically they way we currently build our extensions, the compiler & linker
assume every symbol inside the extension libraries needs to be interceptable
by the main binary. Which means that all function calls to symbols visible
outside the current translation unit need to be made indirectly via the PLT.

An example of this (picked from plpgsql, for simplicity)

0000000000024a40 <plpgsql_inline_handler>:
{
...
func = plpgsql_compile_inline(codeblock->source_text);
24a80: 48 8b 85 a8 fe ff ff mov -0x158(%rbp),%rax
24a87: 48 8b 78 08 mov 0x8(%rax),%rdi
24a8b: e8 20 41 fe ff call 8bb0 <plpgsql_compile_inline(at)plt>
...

0000000000008bb0 <plpgsql_compile_inline(at)plt>:
8bb0: ff 25 da ac 02 00 jmp *0x2acda(%rip) # 33890 <plpgsql_compile_inline@@Base+0x24de0>
8bb6: 68 12 01 00 00 push $0x112
8bbb: e9 c0 ee ff ff jmp 7a80 <_init+0x18>

I.e. plpgsql_inline_handler doesn't call plpgsql_compile_inline() directly, it
calls plpgsql_compile_inline(at)plt(), which then loads the target address for
plpgsql_compile_inline() from the global offset table. Depending on the linker
settings / flags passed to dlopen() that'll point to yet another wrapper
function (doing a dynamic symbol lookup on the first call, putting the
real address in the GOT).

This can be addressed to some degree by using explicit symbol visibility
markers, as I propose in [1].

With that patch applied compiler / linker know that plpgsql_compile_inline()
is not an external symbol, and therefore doesn't need to go through the
PLT/GOT. That changes the above to:

func = plpgsql_compile_inline(codeblock->source_text);
23000: 48 8b 85 a8 fe ff ff mov -0x158(%rbp),%rax
23007: 48 8b 78 08 mov 0x8(%rax),%rdi
2300b: e8 00 a1 fe ff call d110 <plpgsql_compile_inline>

which unsurprisingly is cheaper.

2) Calls to exported functions in extension library

However, this does *not* address the issue fully. When an extension calls a
function that has to be exported, the symbol with continue to be loaded from
the PLT.

E.g. hstorePairs() has to be exported, because it's called from transform
modules. That results in calls to hstorePairs() from within hstore.so to go
through the PLT. e.g.

000000000000e380 <hstore_subscript_assign>:
{
...
e427: e8 e4 59 ff ff call 3e10 <hstorePairs(at)plt>

In theory we could mark such symbols as "protected" while compiling hstore.so
and as "default" otherwise, but that's pretty complicated. And there are some
toolchain issues with protected visibility.

The easier approach for this class of issues is to use the linker option
-Bsymbolic. That turns the above into a plain function call

000000000000e250 <hstore_subscript_assign>:
{
...
e2f7: e8 f4 a2 ff ff call 85f0 <hstorePairs>

As it turns out we already use -Bsymbolic on some platforms (solaris,
hpux). But not elsehwere.

3) Function calls from extension library to main binary
4) C library function calls

However, even with the above done, calls into shared libraries still
go through the PLT. This is particularly annoying for functions like palloc()
that are quite performance sensitive and where there's no potential use of
intercepting the function call with a different shared library.

E.g. the optimized disassembly add_dummy_return() looks like

000000000000bc30 <add_dummy_return>:
{
...
new = palloc0(sizeof(PLpgSQL_stmt_block));
bc4d: bf 38 00 00 00 mov $0x38,%edi
bc52: e8 d9 a7 ff ff call 6430 <palloc0(at)plt>
...
0000000000006430 <palloc0(at)plt>:
6430: ff 25 d2 bb 02 00 jmp *0x2bbd2(%rip) # 32008 <palloc0>
6436: 68 01 00 00 00 push $0x1
643b: e9 d0 ff ff ff jmp 6410 <_init+0x20>

Obviously we cannot easily avoid indirection entirely in this case. The offset
to call palloc0 is not known when plpgsql.so is built. But we don't actually
need a two-level indirection.

By compiling with -fno-plt, the above becomes:

000000000000b130 <add_dummy_return>:
{
...
new = palloc0(sizeof(PLpgSQL_stmt_block));
b14d: bf 38 00 00 00 mov $0x38,%edi
b152: ff 15 80 66 02 00 call *0x26680(%rip) # 317d8 <palloc0>

I.e. a single level of indirection. This has more benefits than just removing
one layer of indirection. Here's what gcc's man page says:

-fno-plt
Do not use the PLT for external function calls in position-independent code. Instead, load the callee address
at call sites from the GOT and branch to it. This leads to more efficient code by eliminating PLT stubs and
exposing GOT loads to optimizations.

In some cases this allows functions to use the sibling-call optimization where
that previously was not possible (i.e. for x86 use "jmp" instead of "call" to
call another function when that function call is the last thing done in a
function, thereby reusing the call frame and reducing the cost of returns).

This doesn't just matter for extension libraries. It's also relevant for the
main binary (i.e. the upsides are bigger / more widely applicable) - every
function call to libc goes through PLT+GOT (well, with a dynamically linked
libc). This includes things that are often called in performance critical
bits, like strlen. E.g. without -fno-plt raw_parser() calls strlen via the
plt:

cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
2775a6: 49 63 55 00 movslq 0x0(%r13),%rdx
2775aa: 4c 8b 3b mov (%rbx),%r15
2775ad: 48 89 4d c0 mov %rcx,-0x40(%rbp)
2775b1: 49 8d 3c 17 lea (%r15,%rdx,1),%rdi
2775b5: 48 89 55 c8 mov %rdx,-0x38(%rbp)
2775b9: e8 82 03 e5 ff call c7940 <strlen(at)plt>

but not with -fno-plt:
cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
2838e6: 49 63 55 00 movslq 0x0(%r13),%rdx
2838ea: 4c 8b 3b mov (%rbx),%r15
2838ed: 48 89 4d c0 mov %rcx,-0x40(%rbp)
2838f1: 49 8d 3c 17 lea (%r15,%rdx,1),%rdi
2838f5: 48 89 55 c8 mov %rdx,-0x38(%rbp)
2838f9: ff 15 09 45 66 00 call *0x664509(%rip) # 8e7e08 <strlen(at)GLIBC_2(dot)2(dot)5>

I haven't run detailed benchmarks in isolation, but have seen some good
results. It obviously is heavily workload dependent.

Greetings,

Andres Freund

[1] https://postgr.es/m/20211101020311.av6hphdl6xbjbuif%40alap3.anarazel.de

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-11-22 21:56:43 Re: LogwrtResult contended spinlock
Previous Message Jeremy Schneider 2021-11-22 20:42:12 Re: Sequence's value can be rollback after a crashed recovery.