Add GUC to tune glibc's malloc implementation.

From: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
To: pgsql-hackers(at)postgresql(dot)org
Cc: tomas(dot)vondra(at)enterprisedb(dot)com
Subject: Add GUC to tune glibc's malloc implementation.
Date: 2023-06-22 13:35:12
Message-ID: 3424675.QJadu78ljV@aivenlaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Following some conversation with Tomas at PGCon, I decided to resurrect this
topic, which was previously discussed in the context of moving tuplesort to
use GenerationContext: https://www.postgresql.org/message-id/
8046109.NyiUUSuA9g%40aivenronan

The idea for this patch is that the behaviour of glibc's malloc can be
counterproductive for us in some cases. To summarise, glibc's malloc offers
(among others) two tunable parameters which greatly affects how it allocates
memory. From the mallopt manpage:

M_TRIM_THRESHOLD
When the amount of contiguous free memory at the top of
the heap grows sufficiently large, free(3) employs sbrk(2)
to release this memory back to the system. (This can be
useful in programs that continue to execute for a long
period after freeing a significant amount of memory.)

M_MMAP_THRESHOLD
For allocations greater than or equal to the limit
specified (in bytes) by M_MMAP_THRESHOLD that can't be
satisfied from the free list, the memory-allocation
functions employ mmap(2) instead of increasing the program
break using sbrk(2).

The thing is, by default, those parameters are adjusted dynamically by the
glibc itself. It starts with quite small thresholds, and raises them when the
program frees some memory, up to a certain limit. This patch proposes a new
GUC allowing the user to adjust those settings according to their workload.

This can cause problems. Let's take for example a table with 10k rows, and 32
columns (as defined by a bench script David Rowley shared last year when
discussing the GenerationContext for tuplesort), and execute the following
query, with 32MB of work_mem:

select * from t order by a offset 100000;

On unpatched master, attaching strace to the backend and grepping on brk|mmap,
we get the following syscalls:

brk(0x55b00df0c000) = 0x55b00df0c000
brk(0x55b00df05000) = 0x55b00df05000
brk(0x55b00df28000) = 0x55b00df28000
brk(0x55b00df52000) = 0x55b00df52000
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc49254000
brk(0x55b00df7e000) = 0x55b00df7e000
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc48f7f000
mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc48e7e000
mmap(NULL, 200704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc4980f000
brk(0x55b00df72000) = 0x55b00df72000
mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fbc3d56d000

Using systemtap, we can hook to glibc's mallocs static probes to log whenever
it adjusts its values. During the above queries, glibc's malloc raised its
thresholds:

347704: New thresholds: mmap: 2101248 bytes, trim: 4202496 bytes

If we re-run the query again, we get:

brk(0x55b00dfe2000) = 0x55b00dfe2000
brk(0x55b00e042000) = 0x55b00e042000
brk(0x55b00e0ce000) = 0x55b00e0ce000
brk(0x55b00e1e6000) = 0x55b00e1e6000
brk(0x55b00e216000) = 0x55b00e216000
brk(0x55b00e416000) = 0x55b00e416000
brk(0x55b00e476000) = 0x55b00e476000
brk(0x55b00dfbc000) = 0x55b00dfbc000

This time, our allocations are below the new mmap_threshold, so malloc gets us
our memory by repeatedly moving the brk pointer.

When running with the attached patch, and setting the new GUC:

set glibc_malloc_max_trim_threshold = '64MB';

We now get the following syscalls for the same query, for the first run:

brk(0x55b00df0c000) = 0x55b00df0c000
brk(0x55b00df2e000) = 0x55b00df2e000
brk(0x55b00df52000) = 0x55b00df52000
brk(0x55b00dfb2000) = 0x55b00dfb2000
brk(0x55b00e03e000) = 0x55b00e03e000
brk(0x55b00e156000) = 0x55b00e156000
brk(0x55b00e186000) = 0x55b00e186000
brk(0x55b00e386000) = 0x55b00e386000
brk(0x55b00e3e6000) = 0x55b00e3e6000

But for the second run, the memory allocated is kept by malloc's freelist
instead of being released to the kernel, generating no syscalls at all, which
brings us a significant performance improvement at the cost of more memory
being used by the idle backend, up to twice as more tps.

On the other hand, the default behaviour can also be a problem if a backend
makes big allocations for a short time and then never needs that amount of
memory again.

For example, running this query:

select * from generate_series(1, 1000000);

We allocate some memory. The first time it's run, malloc will use mmap to
satisfy it. Once it's freed, it will raise it's threshold, and a second run
will allocate it on the heap instead. So if we run the query twice, we end up
with some memory in malloc's free lists that we may never use again. Using the
new GUC, we can actually control wether it will be given back to the OS by
setting a small value for the threshold.

I attached the results of the 10k rows / 32 columns / 32MB work_mem benchmark
with different values for glibc_malloc_max_trim_threshold.

I don't know how to write a test for this new feature so let me know if you
have suggestions. Documentation is not written yet, as I expect discussion on
this thread to lead to significant changes on the user-visible GUC or GUCs:
- should we provide one for trim which also adjusts mmap_threshold (current
patch) or several GUCs ?
- should this be simplified to only offer the default behaviour (glibc's takes
care of the threshold) and some presets ("greedy", to set trim_threshold to
work_mem, "frugal" to set it to a really small value)

Best regards,

--
Ronan Dunklau

Attachment Content-Type Size
v1-0001-Add-options-to-tune-malloc.patch text/x-patch 10.6 KB
results_generation.ods application/vnd.oasis.opendocument.spreadsheet 53.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-06-22 13:49:36 Re: Add GUC to tune glibc's malloc implementation.
Previous Message Ranier Vilela 2023-06-22 11:57:40 Re: Making empty Bitmapsets always be NULL