Re: scalability bottlenecks with (many) partitions (and more)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Date: 2024-01-31 18:25:54
Message-ID: 0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/29/24 16:42, Ronan Dunklau wrote:
> Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
>> I'm not sure work_mem is a good parameter to drive this. It doesn't say
>> how much memory we expect the backend to use - it's a per-operation
>> limit, so it doesn't work particularly well with partitioning (e.g. with
>> 100 partitions, we may get 100 nodes, which is completely unrelated to
>> what work_mem says). A backend running the join query with 1000
>> partitions uses ~90MB (judging by data reported by the mempool), even
>> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.
>
> I understand your point, I was basing my previous observations on what a
> backend typically does during the execution.
>
>>
>> The mempool could tell us how much memory we need (but we could track
>> this in some other way too, probably). And we could even adjust the mmap
>> parameters regularly, based on current workload.
>>
>> But there's then there's the problem that the mmap parameters don't tell
>> If we > > us how much memory to keep, but how large chunks to release.
>>
>> Let's say we want to keep the 90MB (to allocate the memory once and then
>> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
>> but then it takes just a little bit of extra memory to release all the
>> memory, or something.
>
> For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a
> certain amount of memory is always kept.
>
> But the way the dynamic adjustment works makes it sort-of work like this.
> MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't
> expect to keep much memory around.
>
> So even "small" memory allocations will be served using mmap at first. Once
> mmaped memory is released, glibc's consider it a benchmark for "normal"
> allocations that can be routinely freed, and adjusts mmap_threshold to the
> released mmaped region size, and trim threshold to two times that.
>
> It means over time the two values will converge either to the max value (32MB
> for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to
> accomodate your released memory, since anything bigger than half trim
> threshold will be allocated using mmap.
>
> Setting any parameter disable that.
>

Thanks. I gave this a try, and I started the tests with this setting:

export MALLOC_TOP_PAD_=$((64*1024*1024))
export MALLOC_MMAP_THRESHOLD_=$((1024*1024))
export MALLOC_TRIM_THRESHOLD_=$((1024*1024))

which I believe means that:

1) we'll keep 64MB "extra" memory on top of heap, serving as a cache for
future allocations

2) everything below 1MB (so most of the blocks we allocate for contexts)
will be allocated on heap (hence from the cache)

3) we won't trim heap unless there's at least 1MB of free contiguous
space (I wonder if this should be the same as MALLOC_TOP_PAD)

Those are mostly arbitrary values / guesses, and I don't have complete
results yet. But from the results I have it seems this has almost the
same effect as the mempool thing - see the attached PDF, with results
for the "partitioned join" benchmark.

first column - "master" (17dev) with no patches, default glibc

second column - 17dev + locking + mempool, default glibc

third column - 17dev + locking, tuned glibc

The color scale on the right is throughput comparison (third/second), as
a percentage with e.g. 90% meaning tuned glibc is 10% slower than the
mempool results. Most of the time it's slower but very close to 100%,
sometimes it's a bit faster. So overall it's roughly the same.

The color scales below the results is a comparison of each branch to the
master (without patches), showing comparison to current performance.
It's almost the same, although the tuned glibc has a couple regressions
that the mempool does not have.

> But I'm not arguing against the mempool, just chiming in with glibc's malloc
> tuning possibilities :-)
>

Yeah. I think the main problem with the glibc parameters is that it's
very implementation-specific and also static - the mempool is more
adaptive, I think. But it's an interesting experiment.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
glibc-malloc-tuning.pdf application/pdf 76.7 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-01-31 18:27:51 Re: Flushing large data immediately in pqcomm
Previous Message Tristan Partin 2024-01-31 18:16:52 Re: Fix some ubsan/asan related issues