Quick Links

Re: Use generation context to speed up tuplesorts

From:	Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject:	Re: Use generation context to speed up tuplesorts
Date:	2021-12-16 10:56:15
Message-ID:	2740198.88bMQJbFj6@aivenronan
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Le mercredi 8 décembre 2021, 22:07:12 CET Tomas Vondra a écrit :
> On 12/8/21 16:51, Ronan Dunklau wrote:
> > Le jeudi 9 septembre 2021, 15:37:59 CET Tomas Vondra a écrit :
> >> And now comes the funny part - if I run it in the same backend as the
> >>
> >> "full" benchmark, I get roughly the same results:
> >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>
> >> ------------+------------+---------------+----------+---------
> >>
> >> 32768 | 512 | 806256640 | 37159 | 76669
> >>
> >> but if I reconnect and run it in the new backend, I get this:
> >> block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >>
> >> ------------+------------+---------------+----------+---------
> >>
> >> 32768 | 512 | 806158336 | 233909 | 100785
> >>
> >> (1 row)
> >>
> >> It does not matter if I wait a bit before running the query, if I run it
> >> repeatedly, etc. The machine is not doing anything else, the CPU is set
> >> to use "performance" governor, etc.
> >
> > I've reproduced the behaviour you mention.
> > I also noticed asm_exc_page_fault showing up in the perf report in that
> > case.
> >
> > Running an strace on it shows that in one case, we have a lot of brk
> > calls,
> > while when we run in the same process as the previous tests, we don't.
> >
> > My suspicion is that the previous workload makes glibc malloc change it's
> > trim_threshold and possibly other dynamic options, which leads to
> > constantly moving the brk pointer in one case and not the other.
> >
> > Running your fifo test with absurd malloc options shows that indeed that
> > might be the case (I needed to change several, because changing one
> > disable the dynamic adjustment for every single one of them, and malloc
> > would fall back to using mmap and freeing it on each iteration):
> >
> > mallopt(M_TOP_PAD, 1024 * 1024 * 1024);
> > mallopt(M_TRIM_THRESHOLD, 256 * 1024 * 1024);
> > mallopt(M_MMAP_THRESHOLD, 4*1024*1024*sizeof(long));
> >
> > I get the following results for your self contained test. I ran the query
> > twice, in each case, seeing the importance of the first allocation and the
> > subsequent ones:
> >
> > With default malloc options:
> > block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >
> > ------------+------------+---------------+----------+---------
> >
> > 32768 | 512 | 795836416 | 300156 | 207557
> >
> > block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >
> > ------------+------------+---------------+----------+---------
> >
> > 32768 | 512 | 795836416 | 211942 | 77207
> >
> > With the oversized values above:
> > block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >
> > ------------+------------+---------------+----------+---------
> >
> > 32768 | 512 | 795836416 | 219000 | 36223
> >
> > block_size | chunk_size | mem_allocated | alloc_ms | free_ms
> >
> > ------------+------------+---------------+----------+---------
> >
> > 32768 | 512 | 795836416 | 75761 | 78082
> >
> > (1 row)
> >
> > I can't tell how representative your benchmark extension would be of real
> > life allocation / free patterns, but there is probably something we can
> > improve here.
>
> Thanks for looking at this. I think those allocation / free patterns are
> fairly extreme, and there probably are no workloads doing exactly this.
> The idea is the actual workloads are likely some combination of these
> extreme cases.
>
> > I'll try to see if I can understand more precisely what is happening.
>
> Thanks, that'd be helpful. Maybe we can learn something about tuning
> malloc parameters to get significantly better performance.
>

Apologies for the long email, maybe what I will state here is obvious for
others but I learnt a lot, so...

I think I understood what the problem was in your generation tests: depending
on the sequence of allocations we will raise a different maximum for
mmap_threshold and trim_threshold. When an mmap block is freed, malloc will
raise it's threshold as it consider memory freed to be better served by
regular moving of the sbrk pointer, instead of using mmap to map memory. This
threshold is upped by multiplying it by two anytime we free a mmap'ed block.

At the same time, the trim_threshold is raised to double the mmap_threshold,
considering that this amount of memory should not be released to the OS
because we have a good chance of reusing it.

This can be demonstrated using the attached systemtap script, along with a
patch adding new traces to generation_context for this purpose:
When running your query:

select block_size, chunk_size, x.* from
(values (512)) AS a(chunk_size),
(values (32768)) AS b(block_size),
lateral generation_bench_fifo(1000000, block_size, chunk_size,
2*chunk_size, 100, 10000, 5000) x;

We obtain the following trace for thresholds adjustments:

2167837: New thresholds: mmap: 135168 bytes, trim: 270336 bytes
2167837: New thresholds: mmap: 266240 bytes, trim: 532480 bytes
2167837: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes
2167837: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes
2167837: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes

When running the full benchmark, we reach a higher threshold at some point:

2181301: New thresholds: mmap: 135168 bytes, trim: 270336 bytes
2181301: New thresholds: mmap: 266240 bytes, trim: 532480 bytes
2181301: New thresholds: mmap: 528384 bytes, trim: 1056768 bytes
2181301: New thresholds: mmap: 1052672 bytes, trim: 2105344 bytes
2181301: New thresholds: mmap: 16003072 bytes, trim: 32006144 bytes
2181301: New thresholds: mmap: 24002560 bytes, trim: 48005120 bytes

This is because at some point in the full benchmark, we allocate a block
bigger than mmap threshold, which malloc serves by an mmap, then frees it
which means we now also raise the trim_threshold. The subsequent allocations
in the lone query end up between those thresholds, so are served by moving the
sbrk pointer, then releasing the memory back to the OS, which turns out to be
expensive too.

One thing that I observed is that that effect of constantly moving the sbrk
pointer still happens, but not as much in a real tuplesort workload. I haven't
tested the reordering buffer for now, which is the other part of the code using
generation context.

So, I decided to benchmark trying to control the malloc thresholds. This
result is on my local laptop, with an I7 processor, for the benchmark
initially proposed by David, with 4GB work_mem and default settings.

I benchmarked master, the original patch with fixed block size, and the one
adjusting the size.

For each of them, I ran them using default malloc options (ie, adjusting the
thresholds dynamically), and with setting MMAP_MMAP_THRESHOLD to 32MB (the
maximum on my platform, 4 * 1024 * 1024 * sizeof(long)) and the corresponding
MMAP_TRIM_THRESHOLD if malloc would have adjusted it by itself reaching this
value (ie, 64MB). I did that by setting the corresponding env variables.

The results is in the attached spreadsheet.

I will follow up with a benchmark of the test sorting a table with a width
varying from 1 to 32 columns.

As of now, my conclusion is that for glibc malloc, the block size we use
doesn't really matter, as long as we tell malloc to not release a certain
amount of memory back to the system once it's been allocated. Setting the
mmap_threshold to min(work_mem, DEFAULT_MMAP_THRESHOLD_MAX) and trim_threshold
to two times that would IMO take us to where a long-lived backend would likely
end anyway: as soon as we alloc min(work_mem, 32MB), we won't give it back to
the system, and save us a huge amount a syscalls in common cases. Doing that
will not change the allocation profiles for other platforms, and should be safe
for those.

The only problem I see if we were to do that would be for allocations in
excess of work_mem, which would no longer trigger a threshold raise since once
we're in "non-dynamic" mode, glibc's malloc would keep our manually-set
values. I guess this proposal could be refined to set it up dynamically
ourselves when pfreeing blocks, just like malloc does.

What do you think of this analysis and idea ?

--
Ronan Dunklau

Attachment	Content-Type	Size
add_tracing_to_generation_context.patch	text/x-patch	2.2 KB
malloc_generation.stp	text/plain	1.9 KB
bench_results_5tests.ods	application/vnd.oasis.opendocument.spreadsheet	14.6 KB

In response to

Re: Use generation context to speed up tuplesorts at 2021-12-08 21:07:12 from Tomas Vondra

Responses

Re: Use generation context to speed up tuplesorts at 2021-12-16 16:03:39 from Ronan Dunklau

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2021-12-16 11:11:25	Re: pg_upgrade should truncate/remove its logs before running
Previous Message	Shay Rojansky	2021-12-16 10:38:09	Re: Privilege required for IF EXISTS event if the object already exists