Re: Use generation context to speed up tuplesorts

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Use generation context to speed up tuplesorts
Date: 2021-09-09 13:37:59
Message-ID: bcdd4e3e-c12d-cd2b-7ead-a91ad416100a@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've spent quite a bit of time investigating the regressions clearly
visible in the benchmark results for some allocation/free patterns.
Attached is a subset of results from a recent run, but the old results
show mostly the same thing.

Some allocation patterns (e.g. fifo-cycle of lifo-cycle) are perfectly
fine - the patches are a clear improvement. For for example results for
fifo-decrease pattern look like this (showing just timings of palloc):

block chunk 0001 0002 0003 0004
----------------------------------------------------
32768 512 38497 37234 39817 226451

0001 just adds an extension used for the benchmarking, so it's pretty
much "master" without any behavior changes. 0002 and 0003 make some
changes that seem to have no impact on this workload, but 0004 makes it
about 5x slower. Which is bizarre, because 0004 just tweaks how we
calculate the threshold for oversized chunks.

But that shouldn't have any impact, because with the benchmark
parameters the threshold should be 64kB, way more than the chunk size
(which is what we allocate).

Similarly, it should not matter whether we double or quadruple the block
size, because we reach the maximum block size (1MB) very fast in both
cases, and then just just 1MB blocks for 99% of the benchmark.

I was thinking that maybe the extra allocChunkLimit increases the size
of the GenerationContext struct, requiring another cacheline. But no,
it's 3 cachelines in both cases.

But while trying to investigate / profile this, I noticed a rather
bizarre behavior, which I can't explain but I suspect it may be related
to the slowdown.

The benchmark queries are executing generation_bench_fifo() for a range
of block size / chunk_size combinations, and look about like this:

select block_size, chunk_size, x.*
from generate_series(32,512,32) a(chunk_size),
(values (1024), (2048), (4096), (8192), (16384), (32768))
AS b(block_size),
lateral generation_bench_fifo(1000000, block_size, chunk_size,
2*chunk_size, 100, 10000, 5000) x;

Which takes a while to run, and then produces something like this:

block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
1024 | 32 | 81528832 | 62016 | 24601
1024 | 64 | 137449472 | 61951 | 36034
1024 | 96 | 208408464 | 85437 | 57984
...
32768 | 448 | 707756032 | 36605 | 67648
32768 | 480 | 757104640 | 36838 | 71967
32768 | 512 | 806256640 | 37073 | 76489
(96 rows)

Now, I can run benchmark for a single case (e.g. the last one in the
results above), say like this:

select block_size, chunk_size, x.* from
(values (512)) AS a(chunk_size),
(values (32768)) AS b(block_size),
lateral generation_bench_fifo(1000000, block_size, chunk_size,
2*chunk_size, 100, 10000, 5000) x;

And now comes the funny part - if I run it in the same backend as the
"full" benchmark, I get roughly the same results:

block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 806256640 | 37159 | 76669

but if I reconnect and run it in the new backend, I get this:

block_size | chunk_size | mem_allocated | alloc_ms | free_ms
------------+------------+---------------+----------+---------
32768 | 512 | 806158336 | 233909 | 100785
(1 row)

It does not matter if I wait a bit before running the query, if I run it
repeatedly, etc. The machine is not doing anything else, the CPU is set
to use "performance" governor, etc.

This is really strange, and the really suspicious thing is that this
slower timing is very close to the 0004 timing (which is 226451). So
perhaps the reason for these slowdowns is the same, it's just that with
0004 it happens all the time.

According to perf profiling, it seems there's one pretty significant
change - 0004 spends quite a bit of time in asm_exc_page_fault

41.17% 5.26% postgres postgres [.] GenerationAlloc
|
--36.06%--GenerationAlloc
|
|--28.95%--asm_exc_page_fault
| |
| --23.17%--exc_page_fault
| |

while on 0003 there's not a single frame with asm_exc_page_fault. I have
no idea why, though.

I wonder if this might be some unexpected / unfortunate interaction with
kernel memory management, or something like that. Any ideas what might
be the root cause, what to look for, etc.?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
generation.ods application/vnd.oasis.opendocument.spreadsheet 337.5 KB
0001-generation-bench.patch text/x-patch 32.2 KB
0002-Generation-grow-blocks.patch text/x-patch 7.5 KB
0003-Generation-keeper-block.patch text/x-patch 4.4 KB
0004-Generation-allocChunkLimit.patch text/x-patch 3.0 KB
bench.sql application/sql 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-09-09 13:44:55 Re: trap instead of error on 32 TiB table
Previous Message Robert Haas 2021-09-09 13:30:57 Re: trap instead of error on 32 TiB table