Re: Use generation context to speed up tuplesorts

From: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
To: David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Subject: Re: Use generation context to speed up tuplesorts
Date: 2021-12-08 15:51:17
Message-ID: 8046109.NyiUUSuA9g@aivenronan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> And now comes the funny part - if I run it in the same backend as the
> "full" benchmark, I get roughly the same results:
>
> block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> ------------+------------+---------------+----------+---------
> 32768 | 512 | 806256640 | 37159 | 76669
>
> but if I reconnect and run it in the new backend, I get this:
>
> block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> ------------+------------+---------------+----------+---------
> 32768 | 512 | 806158336 | 233909 | 100785
> (1 row)
>
> It does not matter if I wait a bit before running the query, if I run it
> repeatedly, etc. The machine is not doing anything else, the CPU is set
> to use "performance" governor, etc.

I've reproduced the behaviour you mention.
I also noticed asm_exc_page_fault showing up in the perf report in that case.

Running an strace on it shows that in one case, we have a lot of brk calls,
while when we run in the same process as the previous tests, we don't.

My suspicion is that the previous workload makes glibc malloc change it's
trim_threshold and possibly other dynamic options, which leads to constantly
moving the brk pointer in one case and not the other.

Running your fifo test with absurd malloc options shows that indeed that might
be the case (I needed to change several, because changing one disable the
dynamic adjustment for every single one of them, and malloc would fall back to
using mmap and freeing it on each iteration):

mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));

I get the following results for your self contained test. I ran the query
twice, in each case, seeing the importance of the first allocation and the
subsequent ones:

With default malloc options:

block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 300156 | 207557

block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 211942 | 77207

With the oversized values above:

block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 219000 | 36223

block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 795836416 | 75761 | 78082
(1 row)

I can't tell how representative your benchmark extension would be of real life
allocation / free patterns, but there is probably something we can improve
here.

I'll try to see if I can understand more precisely what is happening.

--
Ronan Dunklau

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-12-08 15:52:40 Re: Optionally automatically disable logical replication subscriptions on error
Previous Message David G. Johnston 2021-12-08 15:39:34 Re: Question on not-in and array-eq