Re: [PATCH] Add support for choosing huge page size

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Odin Ugedal <odin(at)ugedal(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Add support for choosing huge page size
Date: 2020-06-21 20:55:17
Message-ID: 20200621205517.2wlosnes4h4l3dpv@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-06-18 16:00:49 +1200, Thomas Munro wrote:
> Unfortunately I can't access the TLB miss counters on this system due
> to virtualisation restrictions, and the systems where I can don't have
> 1GB pages. According to cpuid(1) this system has a fairly typical
> setup:
>
> cache and TLB information (2):
> 0x63: data TLB: 2M/4M pages, 4-way, 32 entries
> data TLB: 1G pages, 4-way, 4 entries
> 0x03: data TLB: 4K pages, 4-way, 64 entries

Hm. Doesn't that system have a second level of TLB (STLB) with more 1GB
entries? I think there's some errata around what intel exposes via cpuid
around this :(

Guessing that this is a skylake server chip?
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Memory_Hierarchy

> [...] Additionally there is a unified L2 TLB (STLB)
> [...] STLB
> [...] 1 GiB page translations:
> [...] 16 entries; 4-way set associative

> This operation is touching about 8GB of data (scanning 3.5GB of table,
> building a 4.5GB hash table) so 4 x 1GB is not enough do this without
> TLB misses.

I assume this uses 7 workers?

> Let's try that again, except this time with shared_buffers=4GB,
> dynamic_shared_memory_main_size=4GB, and only half as many tuples in
> t, so it ought to fit:
>
> 4KB pages: 6.37 seconds
> 2MB pages: 4.96 seconds
> 1GB pages: 5.07 seconds
>
> Well that's disappointing.

Hm, I don't actually know the answer to this: If this actually uses
multiple workers, won't the fact that each has an independent page table
(despite having overlapping contents) lead to there being fewer actually
available 1GB entries available? Obviously depends on how processes are
scheduled (iirc hyperthreading shares dTLBs).

Might be worth looking at whether there are cpu migrations or testing
with a single worker.

> I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got:

> 4KB pages: 5.43 seconds
> 2MB pages: 4.05 seconds
> 1GB pages: 4.00 seconds
>
> From this I can't really conclude that it's terribly useful to use
> larger page sizes, but it's certainly useful to have the ability to do
> further testing using the proposed GUC.

Due to the low number of 1GB entries they're quite likely to be
problematic imo. Especially when there's more concurrent misses than
there are page table entries.

I'm somewhat doubtful that it's useful to use 1GB entries for all of our
shared memory when that's bigger than the maximum covered size. I
suspect that it'd better to use 1GB entries for some and smaller entries
for the rest of the memory.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2020-06-21 21:52:37 Re: [PATCH] Missing links between system catalog documentation pages
Previous Message Andres Freund 2020-06-21 20:02:34 Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762