Re: remap the .text segment into huge pages at run time

From: Andres Freund <andres(at)anarazel(dot)de>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: remap the .text segment into huge pages at run time
Date: 2022-11-03 17:21:23
Message-ID: 20221103172123.wiagvldhcfqps2mv@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-11-02 13:32:37 +0700, John Naylor wrote:
> It's been known for a while that Postgres spends a lot of time translating
> instruction addresses, and using huge pages in the text segment yields a
> substantial performance boost in OLTP workloads [1][2].

Indeed. Some of that we eventually should address by making our code less
"jumpy", but that's a large amount of work and only going to go so far.

> The difficulty is,
> this normally requires a lot of painstaking work (unless your OS does
> superpage promotion, like FreeBSD).

I still am confused by FreeBSD being able to do this without changing the
section alignment to be big enough. Or is the default alignment on FreeBSD
large enough already?

> I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
> remap the .text segment to huge pages at program start. Attached is a
> hackish, Meson-only, "works on my machine" patchset to experiment with this
> idea.

I wonder how far we can get with just using the linker hints to align
sections. I know that the linux folks are working on promoting sufficiently
aligned executable pages to huge pages too, and might have succeeded already.

IOW, adding the linker flags might be a good first step.

> 0001 adapts the library to our error logging and GUC system. The overview:
>
> - read ELF info to get the start/end addresses of the .text segment
> - calculate addresses therein aligned at huge page boundaries
> - mmap a temporary region and memcpy the aligned portion of the .text
> segment
> - mmap aligned start address to a second region with huge pages and
> MAP_FIXED
> - memcpy over from the temp region and revoke the PROT_WRITE bit

Would mremap()'ing the temporary region also work? That might be simpler and
more robust (you'd see the MAP_HUGETLB failure before doing anything
irreversible). And you then might not even need this:

> The reason this doesn't "saw off the branch you're standing on" is that the
> remapping is done in a function that's forced to live in a different
> segment, and doesn't call any non-libc functions living elsewhere:
>
> static void
> __attribute__((__section__("lpstub")))
> __attribute__((__noinline__))
> MoveRegionToLargePages(const mem_range * r, int mmap_flags)

This would likely need a bunch more gating than the patch, understandably,
has. I think it'd faily horribly if there were .text relocations, for example?
I think there are some architectures that do that by default...

> 0002 is my attempt to force the linker's hand and get the entire text
> segment mapped to huge pages. It's quite a finicky hack, and easily broken
> (see below). That said, it still builds easily within our normal build
> process, and maybe there is a better way to get the effect.
>
> It does two things:
>
> - Pass the linker -Wl,-zcommon-page-size=2097152
> -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's
> done for predictability, but that means the next 2MB boundary is very
> nearly 2MB away.

Yep. FWIW, my notes say

# align sections to 2MB boundaries for hugepage support
# bfd and gold linkers:
# -Wl,-zmax-page-size=0x200000 -Wl,-zcommon-page-size=0x200000
# lld:
# -Wl,-zmax-page-size=0x200000 -Wl,-z,separate-loadable-segments
# then copy binary to tmpfs mounted with -o huge=always

I.e. with lld you need slightly different flags -Wl,-z,separate-loadable-segments

The meson bit should probably just use
cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x200000',
'-Wl,-zcommon-page-size=0x200000',
'-Wl,-zseparate-loadable-segments'])

Afaict there's really no reason to not do that by default, allowing kernels
that can promote to huge pages to do so.

My approach to forcing huge pages to be used was to then:

# copy binary to tmpfs mounted with -o huge=always

> - Add a "cold" __asm__ filler function that just takes up space, enough to
> push the end of the .text segment over the next aligned boundary, or to
> ~8MB in size.

I don't understand why this is needed - as long as the pages are aligned to
2MB, why do we need to fill things up on disk? The in-memory contents are the
relevant bit, no?

> Since the front is all-cold, and there is very little at the end,
> practically all hot pages are now remapped. The biggest problem with the
> hackish filler function (in addition to maintainability) is, if explicit
> huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
> causes complete startup failure if the .text segment is larger than 8MB.

I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
independent of the .text segment size?

> +/* Callback for dl_iterate_phdr to set the start and end of the .text segment */
> +static int
> +FindMapping(struct dl_phdr_info *hdr, size_t size, void *data)
> +{
> + ElfW(Shdr) text_section;
> + FindParams *find_params = (FindParams *) data;
> +
> + /*
> + * We are only interested in the mapping matching the main executable.
> + * This has the empty string for a name.
> + */
> + if (hdr->dlpi_name[0] != '\0')
> + return 0;
> +

It's not entirely clear we'd only ever want to do this for the main
executable. E.g. plpgsql could also benefit.

> diff --git a/meson.build b/meson.build
> index bfacbdc0af..450946370c 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -239,6 +239,9 @@ elif host_system == 'freebsd'
> elif host_system == 'linux'
> sema_kind = 'unnamed_posix'
> cppflags += '-D_GNU_SOURCE'
> + # WIP: debug builds are huge
> + # TODO: add portability check
> + ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152']

What's that WIP about?

> elif host_system == 'netbsd'
> # We must resolve all dynamic linking in the core server at program start.
> diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c
> new file mode 100644
> index 0000000000..de4e33bb05
> --- /dev/null
> +++ b/src/backend/port/filler.c
> @@ -0,0 +1,29 @@
> +/*
> + * Add enough padding to .text segment to bring the end just
> + * past a 2MB alignment boundary. In practice, this means .text needs
> + * to be at least 8MB. It shouldn't be much larger than this,
> + * because then more hot pages will remain in 4kB pages.
> + *
> + * FIXME: With this filler added, if explicit huge pages are turned off
> + * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash
> + * instead of reporting failure if the .text segment is larger than 8MB.
> + *
> + * See MapStaticCodeToLargePages() in large_page.c
> + *
> + * XXX: The exact amount of filler must be determined experimentally
> + * on platforms of interest, in non-assert builds.
> + *
> + */
> +static void
> +__attribute__((used))
> +__attribute__((cold))
> +fill_function(int x)
> +{
> + /* TODO: More architectures */
> +#ifdef __x86_64__
> +__asm__(
> + ".fill 3251000"
> +);
> +#endif
> + (void) x;
> +}
> \ No newline at end of file
> diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
> index 5ab65115e9..d876712e0c 100644
> --- a/src/backend/port/meson.build
> +++ b/src/backend/port/meson.build
> @@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES')
> endif
>
> if cdata.has('USE_SYSV_SHARED_MEMORY')
> + if host_system == 'linux'
> + backend_sources += files('filler.c')
> + endif
> backend_sources += files('large_page.c')
> backend_sources += files('sysv_shmem.c')
> endif
> --
> 2.37.3
>

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-11-03 18:10:28 Re: security_context_t marked as deprecated in libselinux 3.1
Previous Message Zhihong Yu 2022-11-03 17:04:44 remove unnecessary assignment to tmask in DecodeDateTime