Re: scalability bottlenecks with (many) partitions (and more)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Date: 2024-01-29 14:59:04
Message-ID: 9a403293-f4ed-4df5-9446-5c8faef31efb@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/29/24 15:15, Ronan Dunklau wrote:
> Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :
>>> Did you try running an strace on the process ? That may give you some
>>> hindsights into what malloc is doing. A more sophisticated approach would
>>> be using stap and plugging it into the malloc probes, for example
>>> memory_sbrk_more and memory_sbrk_less.
>>
>> No, I haven't tried that. In my experience strace is pretty expensive,
>> and if the issue is in glibc itself (before it does the syscalls),
>> strace won't really tell us much. Not sure, ofc.
>
> It would tell you how malloc actually performs your allocations, and how often
> they end up translated into syscalls. The main issue with glibc would be that
> it releases the memory too agressively to the OS, IMO.
>
>>
>>> An important part of glibc's malloc behaviour in that regard comes from
>>> the
>>> adjustment of the mmap and free threshold. By default, mmap adjusts them
>>> dynamically and you can poke into that using the
>>> memory_mallopt_free_dyn_thresholds probe.
>>
>> Thanks, I'll take a look at that.
>>
>>>> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
>>>> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
>>>> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
>>>> Which is nice, it confirms it's glibc-specific (I wonder if there's a
>>>> way to tweak glibc to address this), and it also means systems using
>>>> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
>>>> says the mempool has ~20% benefit on top of jemalloc.
>>>
>>> GLIBC's malloc offers some tuning for this. In particular, setting either
>>> M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
>>> adjustment" beheviour and allow you to control what it's doing.
>>>
>>> By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
>>> using sbrk isn't freed as easily, and you don't run into a pattern of
>>> moving the sbrk pointer up and down repeatedly. The automatic trade off
>>> between the mmap and trim thresholds is supposed to prevent that, but the
>>> way it is incremented means you can end in a bad place depending on your
>>> particular allocation patttern.
>>
>> So, what values would you recommend for these parameters?
>>
>> My concern is increasing those value would lead to (much) higher memory
>> usage, with little control over it. With the mempool we keep more
>> blocks, ofc, but we have control over freeing the memory.
>
> Right now depending on your workload (especially if you use connection
> pooling) you can end up with something like 32 or 64MB of dynamically adjusted
> trim-threshold which will never be released back.
>

OK, so let's say I expect each backend to use ~90MB of memory (allocated
at once through memory contexts). How would you set the two limits? By
default it's set to 128kB, which means blocks larger than 128kB are
mmap-ed and released immediately.

But there's very few such allocations - a vast majority of blocks in the
benchmark workloads is <= 8kB or ~27kB (those from btbeginscan).

So I'm thinking about leaving M_TRIM_THRESHOLD as is, but increasing the
M_TRIM_THRESHOLD value to a couple MBs. But I doubt that'll really help,
because what I expect to happen is we execute a query and it allocates
all memory up to a high watermark of ~90MB. And then the query
completes, and we release almost all of it. And even with trim threshold
set to e.g. 8MB we'll free almost all of it, no?

What we want to do is say - hey, we needed 90MB, and now we need 8MB. We
could free 82MB, but maybe let's wait a bit and see if we need that
memory again. And that's pretty much what the mempool does, but I don't
see how to do that using the mmap options.

> The first heurstic I had in mind was to set it to work_mem, up to a
> "reasonable" limit I guess. One can argue that it is expected for a backend to
> use work_mem frequently, and as such it shouldn't be released back. By setting
> work_mem to a lower value, we could ask glibc at the same time to trim the
> excess kept memory. That could be useful when a long-lived connection is
> pooled, and sees a spike in memory usage only once. Currently that could well
> end up with 32MB "wasted" permanently but tuning it ourselves could allow us
> to releaase it back.
>

I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.

But there's then there's the problem that the mmap parameters don't tell
us how much memory to keep, but how large chunks to release.

Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.

> Since it was last year I worked on this, I'm a bit fuzzy on the details but I
> hope this helps.
>

Thanks for the feedback / insights!

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema-Nio 2024-01-29 15:03:44 Re: Should we remove -Wdeclaration-after-statement?
Previous Message Peter Eisentraut 2024-01-29 14:56:12 Re: PG versus libxml2 2.12.x