Re: [PATCH] Add support for choosing huge page size

From: Odin Ugedal <odin(at)ugedal(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Add support for choosing huge page size
Date: 2020-06-21 19:51:11
Message-ID: CAFpoUr0TR2bftHJhB24czz=wT_qrK1-fqvXTu5zySN+-4VS7GQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Documentation syntax error "<literal>2MB<literal>" shows up as:

Ops, sorry, should be fixed now.

> The build is currently failing on Windows:

Ahh, thanks. Looks like the Windows stuff isn't autogenerated, so
maybe this new patch works..

> When using huge_pages=on, huge_page_size=1GB, but default
shared_buffers, I noticed that the error message reports the wrong
(unrounded) size in this message:

Ahh, yes, that is correct. Switched to printing the _real_ allocsize now!

> 1GB pages are so big that it becomes a little tricky to set shared
buffers large enough without wasting RAM. What I mean is, if I want
to use shared_buffers=16GB, I need to have at least 17 huge pages
available, but the 17th page is nearly entirely wasted! Imagine that
on POWER 16GB pages. That makes me wonder if we should actually
redefine these GUCs differently so that you state the total, or at
least use the rounded memory for buffers... I think we could consider
that to be a separate problem with a separate patch though.

Yes, that is a good point! But as you say, I guess that fits better in
another patch.

> Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a
3.5GB table against itself. [...]

Thanks for the results! Will look into your patch when I get time, but
it certainly looks cool! I have a 4-node numa machine with ~100GiB of
memory and a single node numa machine, so i'll take some benchmarks
when I get time!

> I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got: [...]

Yes, making this "properly" numa aware to avoid/limit cross-numa
memory access is kinda tricky. When reserving huge pages they are
distributed more or less evenly between the nodes, and they can be
found by using `grep -R ""
/sys/devices/system/node/node*/hugepages/hugepages-*/nr_hugepages`
(can also be written to), so there _may_ be a chance that the huge
pages you got was on another node than 0 (due to the fact that there
not were enough), but that is just guessing.

tor. 18. jun. 2020 kl. 06:01 skrev Thomas Munro <thomas(dot)munro(at)gmail(dot)com>:
>
> Hi Odin,
>
> Documentation syntax error "<literal>2MB<literal>" shows up as:
>
> config.sgml:1605: parser error : Opening and ending tag mismatch:
> literal line 1602 and para
> </para>
> ^
>
> Please install the documentation tools
> https://www.postgresql.org/docs/devel/docguide-toolsets.html, rerun
> configure and "make docs" to see these kinds of errors.
>
> The build is currently failing on Windows:
>
> undefined symbol: HAVE_DECL_MAP_HUGE_MASK at src/include/pg_config.h
> line 143 at src/tools/msvc/Mkvcbuild.pm line 851.
>
> I think that's telling us that you need to add this stuff into
> src/tools/msvc/Solution.pm, so that we can say it doesn't have it. I
> don't have Windows but whenever you post a new version we'll see if
> Windows likes it here:
>
> http://cfbot.cputube.org/odin-ugedal.html
>
> When using huge_pages=on, huge_page_size=1GB, but default
> shared_buffers, I noticed that the error message reports the wrong
> (unrounded) size in this message:
>
> 2020-06-18 02:06:30.407 UTC [73552] HINT: This error usually means
> that PostgreSQL's request for a shared memory segment exceeded
> available memory, swap space, or huge pages. To reduce the request
> size (currently 149069824 bytes), reduce PostgreSQL's shared memory
> usage, perhaps by reducing shared_buffers or max_connections.
>
> The request size was actually:
>
> mmap(NULL, 1073741824, PROT_READ|PROT_WRITE,
> MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB|30<<MAP_HUGE_SHIFT, -1, 0) = -1
> ENOMEM (Cannot allocate memory)
>
> 1GB pages are so big that it becomes a little tricky to set shared
> buffers large enough without wasting RAM. What I mean is, if I want
> to use shared_buffers=16GB, I need to have at least 17 huge pages
> available, but the 17th page is nearly entirely wasted! Imagine that
> on POWER 16GB pages. That makes me wonder if we should actually
> redefine these GUCs differently so that you state the total, or at
> least use the rounded memory for buffers... I think we could consider
> that to be a separate problem with a separate patch though.
>
> Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a
> 3.5GB table against itself. Hash joins are the perfect way to
> exercise the TLB because they're very likely to miss. I also applied
> my patch[1] to allow parallel queries to use shared memory from the
> main shared memory area, so that they benefit from the configured page
> size, using pages that are allocated once at start up. (Without that,
> you'd have to mess around with /dev/shm mount options, and then hope
> that pages were available at query time, and it'd also be slower for
> other stupid implementation reasons).
>
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> # echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> # echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>
> shared_buffers=8GB
> dynamic_shared_memory_main_size=8GB
>
> create table t as select generate_series(1, 100000000)::int i;
> alter table t set (parallel_workers = 7);
> create extension pg_prewarm;
> select pg_prewarm('t');
> set max_parallel_workers_per_gather=7;
> set work_mem='1GB';
>
> select count(*) from t t1 join t t2 using (i);
>
> 4KB pages: 12.42 seconds
> 2MB pages: 9.12 seconds
> 1GB pages: 9.07 seconds
>
> Unfortunately I can't access the TLB miss counters on this system due
> to virtualisation restrictions, and the systems where I can don't have
> 1GB pages. According to cpuid(1) this system has a fairly typical
> setup:
>
> cache and TLB information (2):
> 0x63: data TLB: 2M/4M pages, 4-way, 32 entries
> data TLB: 1G pages, 4-way, 4 entries
> 0x03: data TLB: 4K pages, 4-way, 64 entries
>
> This operation is touching about 8GB of data (scanning 3.5GB of table,
> building a 4.5GB hash table) so 4 x 1GB is not enough do this without
> TLB misses.
>
> Let's try that again, except this time with shared_buffers=4GB,
> dynamic_shared_memory_main_size=4GB, and only half as many tuples in
> t, so it ought to fit:
>
> 4KB pages: 6.37 seconds
> 2MB pages: 4.96 seconds
> 1GB pages: 5.07 seconds
>
> Well that's disappointing. I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got:
>
> 4KB pages: 5.43 seconds
> 2MB pages: 4.05 seconds
> 1GB pages: 4.00 seconds
>
> From this I can't really conclude that it's terribly useful to use
> larger page sizes, but it's certainly useful to have the ability to do
> further testing using the proposed GUC.
>
> [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com

Attachment Content-Type Size
v4-0001-Add-support-for-choosing-huge-page-size.patch text/x-patch 16.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-06-21 20:02:34 Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
Previous Message Dagfinn Ilmari Mannsåker 2020-06-21 17:49:46 Re: [PATCH] Missing links between system catalog documentation pages