Re: remap the .text segment into huge pages at run time

From: Andres Freund <andres(at)anarazel(dot)de>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: remap the .text segment into huge pages at run time
Date: 2022-11-04 18:33:12
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


This nerd-sniped me badly :)

On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> On 2022-11-02 13:32:37 +0700, John Naylor wrote:
> > I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
> > remap the .text segment to huge pages at program start. Attached is a
> > hackish, Meson-only, "works on my machine" patchset to experiment with this
> > idea.
> I wonder how far we can get with just using the linker hints to align
> sections. I know that the linux folks are working on promoting sufficiently
> aligned executable pages to huge pages too, and might have succeeded already.
> IOW, adding the linker flags might be a good first step.

Indeed, I did see that that works to some degree on the 5.19 kernel I was
running. However, it never seems to get around to using huge pages
sufficiently to compete with explicit use of huge pages.

More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
added into linux 6.1. That explicitly remaps a region and uses huge pages for
it. Of course that's going to take a while to be widely available, but it
seems like a safer approach than the remapping approach from this thread.

I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode
the address / length), and it seems to work nicely.

With the weird caveat that on fs one needs to make sure that the executable
doesn't reflinks to reuse parts of other files, and that the mold linker and
cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
binary with cp --reflink=never

FWIW, you can see the state of the page mapping in more detail with the
kernel's page-types tool

sudo /home/andres/src/kernel/tools/vm/page-types -L -p 12297 -a 0x555555800,0x555556122
sudo /home/andres/src/kernel/tools/vm/page-types -f /srv/dev/build/m-opt/src/backend/postgres2

Perf results:

c=150;psql -f ~/tmp/prewarm.sql;perf stat -a -e cycles,iTLB-loads,iTLB-load-misses,itlb_misses.walk_active,itlb_misses.walk_completed_4k,itlb_misses.walk_completed_2m_4m,itlb_misses.walk_completed_1g pgbench -n -M prepared -S -P1 -c$c -j$c -T10


tps = 1038230.070771 (without initial connection time)

Performance counter stats for 'system wide':

1,184,344,476,152 cycles (71.41%)
2,846,146,710 iTLB-loads (71.43%)
2,021,885,782 iTLB-load-misses # 71.04% of all iTLB cache accesses (71.44%)
75,633,850,933 itlb_misses.walk_active (71.44%)
2,020,962,930 itlb_misses.walk_completed_4k (71.44%)
1,213,368 itlb_misses.walk_completed_2m_4m (57.12%)
2,293 itlb_misses.walk_completed_1g (57.11%)

10.064352587 seconds time elapsed


tps = 1113717.114278 (without initial connection time)

Performance counter stats for 'system wide':

1,173,049,140,611 cycles (71.42%)
1,059,224,678 iTLB-loads (71.44%)
653,603,712 iTLB-load-misses # 61.71% of all iTLB cache accesses (71.44%)
26,135,902,949 itlb_misses.walk_active (71.44%)
628,314,285 itlb_misses.walk_completed_4k (71.44%)
25,462,916 itlb_misses.walk_completed_2m_4m (57.13%)
2,228 itlb_misses.walk_completed_1g (57.13%)

Note that while the rate of itlb-misses stays roughly the same, the total
number of iTLB loads reduced substantially, and the number of cycles in which
an itlb miss was in progress is 1/3 of what it was before.

A lot of the remaining misses are from the context switches. The iTLB is
flushed on context switches, and of course pgbench -S is extremely context
switch heavy.

Comparing plain -S with 10 pipelined -S transactions (using -t 100000 / -t
10000 to compare the same amount of work) I get:


not pipelined:

tps = 1037732.722805 (without initial connection time)

Performance counter stats for 'system wide':

1,691,411,678,007 cycles (62.48%)
8,856,107 itlb.itlb_flush (62.48%)
4,600,041,062 iTLB-loads (62.48%)
2,598,218,236 iTLB-load-misses # 56.48% of all iTLB cache accesses (62.50%)
100,095,862,126 itlb_misses.walk_active (62.53%)
2,595,376,025 itlb_misses.walk_completed_4k (50.02%)
2,558,713 itlb_misses.walk_completed_2m_4m (50.00%)
2,146 itlb_misses.walk_completed_1g (49.98%)

14.582927646 seconds time elapsed


tps = 161947.008995 (without initial connection time)

Performance counter stats for 'system wide':

1,095,948,341,745 cycles (62.46%)
877,556 itlb.itlb_flush (62.46%)
4,576,237,561 iTLB-loads (62.48%)
307,971,166 iTLB-load-misses # 6.73% of all iTLB cache accesses (62.52%)
15,565,279,213 itlb_misses.walk_active (62.55%)
306,240,104 itlb_misses.walk_completed_4k (50.03%)
1,753,560 itlb_misses.walk_completed_2m_4m (50.00%)
2,189 itlb_misses.walk_completed_1g (49.96%)

9.374687885 seconds time elapsed


not pipelined:
tps = 1112040.859643 (without initial connection time)

Performance counter stats for 'system wide':

1,569,546,236,696 cycles (62.50%)
7,094,291 itlb.itlb_flush (62.51%)
1,599,845,097 iTLB-loads (62.51%)
692,042,864 iTLB-load-misses # 43.26% of all iTLB cache accesses (62.51%)
31,529,641,124 itlb_misses.walk_active (62.51%)
669,849,177 itlb_misses.walk_completed_4k (49.99%)
22,708,146 itlb_misses.walk_completed_2m_4m (49.99%)
2,752 itlb_misses.walk_completed_1g (49.99%)

13.611206182 seconds time elapsed


tps = 162484.443469 (without initial connection time)

Performance counter stats for 'system wide':

1,092,897,514,658 cycles (62.48%)
942,351 itlb.itlb_flush (62.48%)
233,996,092 iTLB-loads (62.48%)
102,155,575 iTLB-load-misses # 43.66% of all iTLB cache accesses (62.49%)
6,419,597,286 itlb_misses.walk_active (62.52%)
98,758,409 itlb_misses.walk_completed_4k (50.03%)
3,342,332 itlb_misses.walk_completed_2m_4m (50.02%)
2,190 itlb_misses.walk_completed_1g (49.98%)

9.355239897 seconds time elapsed

The difference in itlb.itlb_flush between pipelined / non-pipelined cases
unsurprisingly is stark.

While the pipelined case still sees a good bit reduced itlb traffic, the total
amount of cycles in which a walk is active is just not large enough to matter,
by the looks of it.


Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2022-11-04 18:38:38 Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL
Previous Message Nikolay Shaplov 2022-11-04 18:06:38 Re: [PATCH] New [relation] option engine