Re: remap the .text segment into huge pages at run time

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: remap the .text segment into huge pages at run time
Date: 2022-11-06 06:56:10
Message-ID: CAFBsxsH7ryBmTzAo7Ot36G+2xZ=0MV6NnbVVgzs6m78wzetsCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 5, 2022 at 3:27 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> > simplified it. Interestingly enough, looking through the commit history,
> > they used to align the segments via linker flags, but took it out here:
> >
> > https://github.com/intel/iodlr/pull/25#discussion_r397787559
> >
> > ...saying "I'm not sure why we added this". :/
>
> That was about using a linker script, not really linker flags though.

Oops, the commit I was referring to pointed to that discussion, but I
should have shown it instead:

--- a/large_page-c/example/Makefile
+++ b/large_page-c/example/Makefile
@@ -28,7 +28,6 @@ OBJFILES= \
filler16.o \

OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES))
-LDFLAGS=-Wl,-z,max-page-size=2097152

But from what you're saying, this flag wouldn't have been enough anyway...

> I don't think the dummy functions are a good approach, there were plenty
> things after it when I played with them.

To be technical, the point wasn't to have no code after it, but to have no
*hot* code *before* it, since with the iodlr approach the first 1.99MB of
.text is below the first aligned boundary within that section. But yeah,
I'm happy to ditch that hack entirely.

> > > With these flags the "R E" segments all start on a 0x200000/2MiB
boundary
> > and
> > > are padded to the next 2MiB boundary. However the OS / dynamic loader
only
> > > maps the necessary part, not all the zero padding.
> > >
> > > This means that if we were to issue a MADV_COLLAPSE, we can before it
do
> > an
> > > mremap() to increase the length of the mapping.
> >
> > I see, interesting. What location are you passing for madvise() and
> > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > aligned boundary within .text?

> /*
> * Make huge pages out of it. Requires at least linux 6.1. We
could
> * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
that
> * much in older kernels.
> */

About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
THP? The man page seems to indicate that.

In the support work I've done, the standard recommendation is to turn THP
off, especially if they report sudden performance problems. If explicit
HP's are used for shared mem, maybe THP is less of a risk? I need to look
back at the tests that led to that advice...

> A real version would have to open /proc/self/maps and do this for at least

I can try and generalize your above sketch into a v2 patch.

> postgres' r-xp mapping. We could do it for libraries too, if they're
suitably
> aligned (both in memory and on-disk).

It looks like plpgsql is only 27 standard pages in size...

Regarding glibc, we could try moving a couple of the hotter functions into
PG, using smaller and simpler coding, if that has better frontend cache
behavior. The paper "Understanding and Mitigating Front-End Stalls in
Warehouse-Scale Computers" talks about this, particularly section 4.4
regarding memcmp().

> > I quickly tried to align the segments with the linker and then in my
patch
> > have the address for mmap() rounded *down* from the .text start to the
> > beginning of that segment. It refused to start without logging an error.
>
> Hm, what linker was that? I did note that you need some additional flags
for
> some of the linkers.

BFD, but I wouldn't worry about that failure too much, since the
mremap()/madvise() strategy has a lot fewer moving parts.

On the subject of linkers, though, one thing that tripped me up was trying
to change the linker with Meson. First I tried

-Dc_args='-fuse-ld=lld'

but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored

When using this in the top level meson.build

elif host_system == 'linux'
sema_kind = 'unnamed_posix'
cppflags += '-D_GNU_SOURCE'
# Align the loadable segments to 2MB boundaries to support remapping to
# huge pages.
ldflags += cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x200000',
'-Wl,-zcommon-page-size=0x200000',
'-Wl,-zseparate-loadable-segments'
])

According to

https://mesonbuild.com/howtox.html#set-linker

I need to add CC_LD=lld to the env vars before invoking, which got rid of
the warning. Then I wanted to verify that lld was actually used, and in

https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

it says I can run this and it should show “Linker: LLD”, but that doesn't
appear for me:

$ readelf --string-dump .comment inst-perf/bin/postgres

String dump of section '.comment':
[ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2022-11-06 07:04:29 Re: Improve logging when using Huge Pages
Previous Message houzj.fnst@fujitsu.com 2022-11-06 06:40:30 RE: Perform streaming logical transactions by background workers and parallel apply