Re: Add GUC to tune glibc's malloc implementation.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc: pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(at)eisentraut(dot)org>, tomas(dot)vondra(at)enterprisedb(dot)com
Subject: Re: Add GUC to tune glibc's malloc implementation.
Date: 2023-06-28 22:31:01
Message-ID: 20230628223101.jprqvuxyzthdehdm@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-06-28 07:26:03 +0200, Ronan Dunklau wrote:
> I see it as a way to have *some* sort of control over the malloc
> implementation we use, instead of tuning our allocations pattern on top of it
> while treating it entirely as a black box. As for the tuning, I proposed
> earlier to replace this parameter expressed in terms of size as a "profile"
> (greedy / conservative) to make it easier to pick a sensible value.

I don't think that makes it very usable - we'll still have idle connections
use up a lot more memory than now in some cases, and not in others, even
though it doesn't help. And it will be very heavily dependent on the OS and
glibc version.

> Le mardi 27 juin 2023, 20:17:46 CEST Andres Freund a écrit :
> > > Except if you hinted we should write our own directly instead ?
> > > > We e.g. could keep a larger number of memory blocks reserved
> > > > ourselves. Possibly by delaying the release of additionally held blocks
> > > > until we have been idle for a few seconds or such.
> > >
> > > I think keeping work_mem around after it has been used a couple times make
> > > sense. This is the memory a user is willing to dedicate to operations,
> > > after all.
> >
> > The biggest overhead of returning pages to the kernel is that that triggers
> > zeroing the data during the next allocation. Particularly on multi-node
> > servers that's surprisingly slow. It's most commonly not the brk() or
> > mmap() themselves that are the performance issue.
> >
> > Indeed, with your benchmark, I see that most of the time, on my dual Xeon
> > Gold 5215 workstation, is spent zeroing newly allocated pages during page
> > faults. That microarchitecture is worse at this than some others, but it's
> > never free (or cache friendly).
>
> I'm not sure I see the practical difference between those, but that's
> interesting. Were you able to reproduce my results ?

I see a bit smaller win than what you observed, but it is substantial.

The runtime difference between the "default" and "cached" malloc are almost
entirely in these bits:

cached:
- 8.93% postgres libc.so.6 [.] __memmove_evex_unaligned_erms
- __memmove_evex_unaligned_erms
+ 6.77% minimal_tuple_from_heap_tuple
+ 2.04% _int_realloc
+ 0.04% AllocSetRealloc
0.02% 0x56281094806f
0.02% 0x56281094e0bf

vs

uncached:

- 14.52% postgres libc.so.6 [.] __memmove_evex_unaligned_erms
8.61% asm_exc_page_fault
- 5.91% __memmove_evex_unaligned_erms
+ 5.78% minimal_tuple_from_heap_tuple
0.04% 0x560130a2900f
0.02% 0x560130a20faf
+ 0.02% AllocSetRealloc
+ 0.02% _int_realloc

+ 3.81% postgres [kernel.vmlinux] [k] native_irq_return_iret
+ 1.88% postgres [kernel.vmlinux] [k] __handle_mm_fault
+ 1.76% postgres [kernel.vmlinux] [k] clear_page_erms
+ 1.67% postgres [kernel.vmlinux] [k] get_mem_cgroup_from_mm
+ 1.42% postgres [kernel.vmlinux] [k] cgroup_rstat_updated
+ 1.00% postgres [kernel.vmlinux] [k] get_page_from_freelist
+ 0.93% postgres [kernel.vmlinux] [k] mtree_range_walk

None of the latter are visible in a profile in the cached case.

I.e. the overhead is encountering page faults and individually allocating the
necessary memory in the kernel.

This isn't surprising, I just wanted to make sure I entirely understand.

Part of the reason this code is a bit worse is that it's using generation.c,
which doesn't cache any part of of the context. Not that aset.c's level of
caching would help a lot, given that it caches the context itself, not later
blocks.

> > FWIW, in my experience trimming the brk()ed region doesn't work reliably
> > enough in real world postgres workloads to be worth relying on (from a
> > memory usage POV). Sooner or later you're going to have longer lived
> > allocations placed that will prevent it from happening.
>
> I'm not sure I follow: given our workload is clearly split at queries and
> transactions boundaries, releasing memory at that time, I've assumed (and
> noticed in practice, albeit not on a production system) that most memory at
> the top of the heap would be trimmable as we don't keep much in between
> queries / transactions.

That's true for very simple workloads, but once you're beyond that you just
need some longer-lived allocation to happen. E.g. some relcache / catcache
miss during the query execution, and there's no exant memory in
CacheMemoryContext, so a new block is allocated.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-06-28 23:24:02 vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases
Previous Message Vik Fearing 2023-06-28 22:30:43 Re: Row pattern recognition