Re: remap the .text segment into huge pages at run time

From: Andres Freund <andres(at)anarazel(dot)de>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: remap the .text segment into huge pages at run time
Date: 2022-11-06 18:16:14
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2022-11-06 13:56:10 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I don't think the dummy functions are a good approach, there were plenty
> > things after it when I played with them.
> To be technical, the point wasn't to have no code after it, but to have no
> *hot* code *before* it, since with the iodlr approach the first 1.99MB of
> .text is below the first aligned boundary within that section. But yeah,
> I'm happy to ditch that hack entirely.

Just because code is colder than the alternative branch, doesn't necessary
mean it's entirely cold overall. I saw hits to things after the dummy function
to have a perf effect.

> > > > With these flags the "R E" segments all start on a 0x200000/2MiB
> boundary
> > > and
> > > > are padded to the next 2MiB boundary. However the OS / dynamic loader
> only
> > > > maps the necessary part, not all the zero padding.
> > > >
> > > > This means that if we were to issue a MADV_COLLAPSE, we can before it
> do
> > > an
> > > > mremap() to increase the length of the mapping.
> > >
> > > I see, interesting. What location are you passing for madvise() and
> > > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > > aligned boundary within .text?
> > /*
> > * Make huge pages out of it. Requires at least linux 6.1. We
> could
> > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
> that
> > * much in older kernels.
> > */
> About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
> THP? The man page seems to indicate that.

MADV_HUGEPAGE works as long as /sys/kernel/mm/transparent_hugepage/enabled is
to always or madvise. My understanding is that MADV_COLLAPSE will work even
if /sys/kernel/mm/transparent_hugepage/enabled is set to never.

> In the support work I've done, the standard recommendation is to turn THP
> off, especially if they report sudden performance problems.

I think that's pretty much an outdated suggestion FWIW. Largely caused by Red
Hat extremely aggressively backpatching transparent hugepages into RHEL 6
(IIRC). Lots of improvements have been made to THP since then. I've tried to
see negative effects maybe 2-3 years back, without success.

I really don't see a reason to ever set
/sys/kernel/mm/transparent_hugepage/enabled to 'never', rather than just 'madvise'.

> If explicit HP's are used for shared mem, maybe THP is less of a risk? I
> need to look back at the tests that led to that advice...

I wouldn't give that advice to customers anymore, unless they use extremely
old platforms or unless there's very concrete evidence.

> > A real version would have to open /proc/self/maps and do this for at least
> I can try and generalize your above sketch into a v2 patch.


> > postgres' r-xp mapping. We could do it for libraries too, if they're
> suitably
> > aligned (both in memory and on-disk).
> It looks like plpgsql is only 27 standard pages in size...
> Regarding glibc, we could try moving a couple of the hotter functions into
> PG, using smaller and simpler coding, if that has better frontend cache
> behavior. The paper "Understanding and Mitigating Front-End Stalls in
> Warehouse-Scale Computers" talks about this, particularly section 4.4
> regarding memcmp().

I think the amount of work necessary for that is nontrivial and continual. So
I'm loathe to go there.

> > > I quickly tried to align the segments with the linker and then in my
> patch
> > > have the address for mmap() rounded *down* from the .text start to the
> > > beginning of that segment. It refused to start without logging an error.
> >
> > Hm, what linker was that? I did note that you need some additional flags
> for
> > some of the linkers.
> BFD, but I wouldn't worry about that failure too much, since the
> mremap()/madvise() strategy has a lot fewer moving parts.
> On the subject of linkers, though, one thing that tripped me up was trying
> to change the linker with Meson. First I tried
> -Dc_args='-fuse-ld=lld'

It's -Dc_link_args=...

> but that led to warnings like this when :
> /usr/bin/ld: warning: -z separate-loadable-segments ignored
> When using this in the top level
> elif host_system == 'linux'
> sema_kind = 'unnamed_posix'
> cppflags += '-D_GNU_SOURCE'
> # Align the loadable segments to 2MB boundaries to support remapping to
> # huge pages.
> ldflags += cc.get_supported_link_arguments([
> '-Wl,-zmax-page-size=0x200000',
> '-Wl,-zcommon-page-size=0x200000',
> '-Wl,-zseparate-loadable-segments'
> ])
> According to
> I need to add CC_LD=lld to the env vars before invoking, which got rid of
> the warning. Then I wanted to verify that lld was actually used, and in

You can just look at, fwiw. Or use ninja -v (in postgres's cases
with -d keeprsp, because the commandline ends up being long enough for a
response file being used).

> it says I can run this and it should show “Linker: LLD”, but that doesn't
> appear for me:
> $ readelf --string-dump .comment inst-perf/bin/postgres
> String dump of section '.comment':
> [ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)

That's added by the compiler, not the linker. See e.g.:

$ readelf --string-dump .comment src/backend/postgres_lib.a.p/storage_ipc_procarray.c.o

String dump of section '.comment':
[ 1] GCC: (Debian 12.2.0-9) 12.2.0


Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2022-11-06 18:50:24 Re: Allow single table VACUUM in transaction block
Previous Message Tom Lane 2022-11-06 17:54:17 Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands