Re: remap the .text segment into huge pages at run time

From: Andres Freund <andres(at)anarazel(dot)de>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: remap the .text segment into huge pages at run time
Date: 2022-11-04 21:21:26
Message-ID: 20221104212126.qfh3yzi7luvyy5d6@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > - Add a "cold" __asm__ filler function that just takes up space, enough to
> > push the end of the .text segment over the next aligned boundary, or to
> > ~8MB in size.
>
> I don't understand why this is needed - as long as the pages are aligned to
> 2MB, why do we need to fill things up on disk? The in-memory contents are the
> relevant bit, no?

I now assume it's because you either observed the mappings set up by the
loader to not include the space between the segments?

With sufficient linker flags the segments are sufficiently aligned both on
disk and in memory to just map more:

bfd: -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
...
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x00000000000c7f58 0x00000000000c7f58 R 0x200000
LOAD 0x0000000000200000 0x0000000000200000 0x0000000000200000
0x0000000000921d39 0x0000000000921d39 R E 0x200000
LOAD 0x0000000000c00000 0x0000000000c00000 0x0000000000c00000
0x00000000002626b8 0x00000000002626b8 R 0x200000
LOAD 0x0000000000fdf510 0x00000000011df510 0x00000000011df510
0x0000000000037fd6 0x000000000006a310 RW 0x200000

gold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,--rosegment
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
...
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x00000000009230f9 0x00000000009230f9 R E 0x200000
LOAD 0x0000000000a00000 0x0000000000a00000 0x0000000000a00000
0x000000000033a738 0x000000000033a738 R 0x200000
LOAD 0x0000000000ddf4e0 0x0000000000fdf4e0 0x0000000000fdf4e0
0x000000000003800a 0x000000000006a340 RW 0x200000

lld: -Wl,-zmax-page-size=0x200000,-zseparate-loadable-segments
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000033710c 0x000000000033710c R 0x200000
LOAD 0x0000000000400000 0x0000000000400000 0x0000000000400000
0x0000000000921cb0 0x0000000000921cb0 R E 0x200000
LOAD 0x0000000000e00000 0x0000000000e00000 0x0000000000e00000
0x0000000000020ae0 0x0000000000020ae0 RW 0x200000
LOAD 0x0000000001000000 0x0000000001000000 0x0000000001000000
0x00000000000174ea 0x0000000000049820 RW 0x200000

mold -Wl,-zmax-page-size=0x200000,-zcommon-page-size=0x200000,-zseparate-loadable-segments
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
...
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000032dde9 0x000000000032dde9 R 0x200000
LOAD 0x0000000000400000 0x0000000000400000 0x0000000000400000
0x0000000000921cbe 0x0000000000921cbe R E 0x200000
LOAD 0x0000000000e00000 0x0000000000e00000 0x0000000000e00000
0x00000000002174e8 0x0000000000249820 RW 0x200000

With these flags the "R E" segments all start on a 0x200000/2MiB boundary and
are padded to the next 2MiB boundary. However the OS / dynamic loader only
maps the necessary part, not all the zero padding.

This means that if we were to issue a MADV_COLLAPSE, we can before it do an
mremap() to increase the length of the mapping.

MADV_COLLAPSE without mremap:

tps = 1117335.766756 (without initial connection time)

Performance counter stats for 'system wide':

1,169,012,466,070 cycles (55.53%)
729,146,640,019 instructions # 0.62 insn per cycle (66.65%)
7,062,923 itlb.itlb_flush (66.65%)
1,041,825,587 iTLB-loads (66.65%)
634,272,420 iTLB-load-misses # 60.88% of all iTLB cache accesses (66.66%)
27,018,254,873 itlb_misses.walk_active (66.68%)
610,639,252 itlb_misses.walk_completed_4k (44.47%)
24,262,549 itlb_misses.walk_completed_2m_4m (44.46%)
2,948 itlb_misses.walk_completed_1g (44.43%)

10.039217004 seconds time elapsed

MADV_COLLAPSE with mremap:

tps = 1140869.853616 (without initial connection time)

Performance counter stats for 'system wide':

1,173,272,878,934 cycles (55.53%)
746,008,850,147 instructions # 0.64 insn per cycle (66.65%)
7,538,962 itlb.itlb_flush (66.65%)
799,861,088 iTLB-loads (66.65%)
254,347,048 iTLB-load-misses # 31.80% of all iTLB cache accesses (66.66%)
14,427,296,885 itlb_misses.walk_active (66.69%)
221,811,835 itlb_misses.walk_completed_4k (44.47%)
32,881,405 itlb_misses.walk_completed_2m_4m (44.46%)
3,043 itlb_misses.walk_completed_1g (44.43%)

10.038517778 seconds time elapsed

compared to a run without any huge pages (via THP or MADV_COLLAPSE):

tps = 1034960.102843 (without initial connection time)

Performance counter stats for 'system wide':

1,183,743,785,066 cycles (55.54%)
678,525,810,443 instructions # 0.57 insn per cycle (66.65%)
7,163,304 itlb.itlb_flush (66.65%)
2,952,660,798 iTLB-loads (66.65%)
2,105,431,590 iTLB-load-misses # 71.31% of all iTLB cache accesses (66.66%)
80,593,535,910 itlb_misses.walk_active (66.68%)
2,105,377,810 itlb_misses.walk_completed_4k (44.46%)
1,254,156 itlb_misses.walk_completed_2m_4m (44.46%)
3,366 itlb_misses.walk_completed_1g (44.44%)

10.039821650 seconds time elapsed

So a 7.96% win from no-huge-pages to MADV_COLLAPSE and a further 2.11% win
from there to also using mremap(), yielding a total of 10.23%. It's similar
across runs.

On my system the other libraries unfortunately aren't aligned properly. It'd
be nice to also remap at least libc. The majority of the remaining misses are
from the vdso (too small for a huge page), libc (not aligned properly),
returning from system calls (which flush the itlb) and pgbench / libpq (I
didn't add the mremap there, there's not enough code for a huge page without
it).

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2022-11-04 22:06:41 Re: Refactor to introduce pg_strcoll().
Previous Message Melanie Plageman 2022-11-04 20:51:46 Re: Split index and table statistics into different types of stats