| From: | Filip Janus <fjanus(at)redhat(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me> |
| Cc: | lakshmi <lakshmigcdac(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, zsolt(dot)parragi(at)percona(dot)com |
| Subject: | Re: Proposal: Adding compression of temporary files |
| Date: | 2026-05-11 07:09:21 |
| Message-ID: | CAFjYY+K+d0PA1sGYr+vuQ__8d3y3gU3S2UMWhbX5_ZYqTrdXmA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Tomas,
Thanks for the thorough benchmark and the script -- it was very helpful
as a starting point for my testing. I understand the results on
your machine were discouraging, and I appreciate the honest assessment.
I ran a similar benchmark on different x86_64 hardware to see how the
results change under more I/O pressure. The short version: lz4 and
zstd show significant speedups once storage or page cache becomes a
bottleneck.
Setup
-----
I used your run-hashjoins.sh as a base, with the same parameters:
100M rows, d in {1, 10, 100, 1000}, w in {1, 4, 8}, drop-caches
between runs. I also added zstd to the compression methods tested,
and tested with a larger compression block size (32 KB instead of
the default 8 KB BLCKSZ).
Two x86_64 machines:
(A) HPE BL460c Gen10, 2x Xeon Gold 6148, 64 GB RAM,
rotational HDD (5 disks), io_uring, Fedora 43
(B) Dell MX840c, Xeon Gold 6148, SATA SSD (~224 GB),
RAM capped to 16 GB via systemd MemoryMax
Both use 32 KB compression blocks (COMPRESS_BLCKSZ = 4*BLCKSZ).
Results
-------
Below are the relative timings (% of uncompressed baseline), directly
comparable to your table. Values below 100% mean compression is faster.
Your results (Xeon, 64 GB, SSD/NVMe, 8 KB blocks):
pglz lz4
rows rep 1 4 8 1 4 8
-------------------------------------------------
10 1 661 688 300 144 148 86
10 1000 460 472 234 119 119 58
100 1 471 303 204 132 135 102
100 1000 378 262 164 107 91 81
Our results, machine A -- x86 HDD, 64 GB, 32 KB blocks:
pglz lz4 zstd
rows rep 1 4 8 1 4 8 1 4 8
----------------------------------------------------------------
100 1 200 119 69 91 82 67 80 50 35
100 10 204 101 70 91 64 66 83 44 39
100 100 220 104 72 94 75 69 85 50 34
100 1000 170 92 54 79 58 52 74 42 28
Our results, machine B -- x86 SATA SSD, 16 GB cap, 32 KB blocks:
pglz lz4 zstd
rows rep 1 4 8 1 4 8 1 4 8
----------------------------------------------------------------
100 1 284 103 79 92 81 82 98 59 53
100 10 262 99 77 92 80 85 96 57 50
100 100 221 89 67 80 70 64 85 49 44
100 1000 155 51 42 72 39 39 77 27 29
Analysis
--------
I think the key difference is page cache pressure. Your machine has
64 GB RAM with 8 GB shared_buffers, leaving ~56 GB for the OS page
cache. Even with 8 connections x ~10 GB temp files = ~80 GB, a large
portion stays cached and synchronous I/O to storage is limited.
On our machines, I/O is a real bottleneck:
- Machine A: rotational HDD with 8 concurrent streams
- Machine B: SATA SSD but only 16 GB RAM, so the page cache
cannot absorb 8 x 12 GB of temp data
Under these conditions, reducing the bytes written translates
directly into wall-clock savings.
Both your results and ours confirm that pglz is simply too slow for
this use case. Your benchmark shows 164-688% overhead; ours shows
155-284% with w=1. Even under heavy I/O contention (w=8 on HDD)
where pglz eventually wins, it never outperforms lz4 or zstd. I
would recommend against offering pglz for temp file compression
altogether -- it creates a trap for users who might try it expecting
reasonable performance.
lz4 looks safe: the worst case in our data is 94% (w=1, d=100 on
HDD) -- barely distinguishable from noise. Under I/O pressure it
delivers 39-52% of baseline time (2-2.5x speedup).
zstd is the most compelling option: it achieves the best compression
ratios (down to 22% of original size on the SATA SSD) and the best
speedups (27-28% of baseline = 3.5x faster), with no regression
exceeding 98% on x86_64. I would recommend zstd as the primary
option to document, with lz4 as a lighter-weight alternative.
Compression block size
----------------------
I also tested 8 KB, 32 KB, and 64 KB compression block sizes.
32 KB appears to be the sweet spot. Example for lz4, d=1000, w=8
on HDD:
COMPRESS_BLCKSZ time (% of no) compressed bytes
--------------------------------------------------------
8 KB (BLCKSZ) 58% 7.47 GB
32 KB (4*BLCKSZ) 52% 7.22 GB
64 KB (8*BLCKSZ) 56% 7.14 GB
The 8K-to-32K improvement comes from fewer compress/decompress calls
(4x fewer), less per-block header overhead, and better compression
ratios. Going to 64K shows diminishing returns and slightly worse
timings, possibly due to increased cache pressure.
Conclusion
----------
I think the data shows that the benefit of temporary file compression
depends heavily on the I/O characteristics of the system. On machines
with fast storage and ample page cache, compression is neutral -- it
means negligible overhead, which is a good outcome on its own. On
systems with real I/O pressure -- slower storage, limited RAM, or
concurrent workloads competing for page cache -- compression delivers
substantial speedups.
The feature does not need to be enabled by default. Compression is
controlled by the temp_file_compression GUC, which defaults to "none".
That means there is no risk of regression for existing users. But for
administrators who know their systems are I/O-constrained -- spinning
disks, limited memory, heavy concurrent spilling -- having the option
to enable lz4 or zstd can make a real difference. The data above shows up
to 3.5x speedup in those scenarios, with no
downside when the setting is left at its default.
I am attaching two PDFs visualizing my SSD/DHH results.
Full CSV results and the benchmark script are attached. Happy to run
additional tests if you have suggestions for other scenarios.
regards
-Filip-
st 25. 3. 2026 v 21:24 odesílatel Tomas Vondra <tomas(at)vondra(dot)me> napsal:
> Hello Filip,
>
> Thanks for the updated patch. I finally had some more time to do a
> review. I think the code looks pretty good, unfortunately the results of
> my performance validation are not very positive :-( That's not your
> fault, of course, but I'm not quite sure it can be fixed.
>
> The test I did is fairly simple - execute a hash join that spills data
> (which is a case that can be compressed), and measure how long it takes.
> And do it from multiple connections concurrently, to spill more data,
> possibly more than available RAM.
>
> The attached script runs a hashjoin query, with these parameters:
>
> * rows: 1M, 10M and 100M rows
> * duplicates: 1, 10, 100, 1000 (determines compressibility)
> * workers: 1, 4, 8 (number of connections)
> * compression: no, pglz, lz4
>
> The system has 64GB RAM, shared_buffers was set to 8GB. That leaves
> ~56GB for system and page cache. The data sizes need to spill about
> 100MB per 1M rows, so 100M rows means ~10GB of temporary files.
>
> So what behavior would be "OK" in various cases?
>
> With 1M and 10M rows, the temporary files can be kept in memory, even
> with 8 connections (we'll write ~8GB temp files in total). The kernel
> may evict some of the data to disk, but that happens in the background,
> and synchronous I/O is required. I believe the best outcome we can
> expect is probably the same duration as without compression.
>
> With 100M rows this generates >10GB of temporary files per connections.
> With 8 connections, that's >80GB, which exceeds the page cache capacity,
> and so will have to do quite a bit of I/O. In this case we expect a
> (hopefully) significant speedup, depending on how compressible the
> temporary data are (the higher the "d" value, the better.
>
> With 50% compression. we'd need to write just 40GB, which could even fit
> into page cache (and not need I/O at all).
>
> Here are the timings from the "xeon" machine, for 10M and 100M rows. The
> attached PDFs have a more complete data from another machine (with two
> types of storage). But the behavior is pretty much the same, so let's
> focus on this example:
>
> | no | pglz | lz4
> rows rep | 1 4 8 | 1 4 8 | 1 4 8
> ---------------------------------------------------------------
> 10 1 | 6 6 15 | 40 41 45 | 8 8 12
> 10 | 6 6 12 | 39 40 43 | 8 8 13
> 100 | 6 6 13 | 36 37 40 | 8 8 9
> 1000 | 6 6 13 | 27 28 30 | 7 7 7
> 100 1 | 76 136 233 | 361 413 477 | 101 184 239
> 10 | 87 143 226 | 368 398 470 | 110 157 248
> 100 | 87 128 233 | 367 402 477 | 96 169 247
> 1000 | 85 138 246 | 322 362 403 | 90 126 198
>
> If we take the "no" compression as a baseline, then the relative timings
> look like this:
>
> | pglz | lz4
> rows rep | 1 4 8 | 1 4 8
> ----------------------------------------------------------
> 10 1 | 661% 688% 300% | 144% 148% 86%
> 10 | 647% 665% 347% | 143% 145% 106%
> 100 | 599% 620% 306% | 135% 139% 74%
> 1000 | 460% 472% 234% | 119% 119% 58%
> 100 1 | 471% 303% 204% | 132% 135% 102%
> 10 | 421% 277% 208% | 127% 110% 110%
> 100 | 421% 313% 204% | 110% 132% 106%
> 1000 | 378% 262% 164% | 107% 91% 81%
>
> That's not very encouraging, unfortunately.
>
> The pglz causes a massive regression, making it ~6x slower eve when
> everything fits into memory, and there's no chance for the compression
> to help. It works better for the large case, where it gets "only" 1.6x
> slower than no compression. That doesn't seem like a great deal.
>
> With lz4 we do much better, it's only ~1.4x slower, and in a couple
> cases it even beats no compression. It actually wins even with 10M rows
> and 8 connections, which is interesting. But even this seems a bit
> disappointing.
>
> The attached PDFs also show how much data was written to temporary files
> (the second chart). It's pretty consistent between pglz/lz4. It's clear
> how the "repetitions" parameter affects compressibility, although it's
> interesting it gets worse for larger data sets. I assume it's a
> consequence of how we write data to a hash table and then spill it,
> which likely "mixes up" the data a bit. But I haven't looked into the
> details, and I don't think it matters very much.
>
> Can you please review my benchmark script, and maybe try reproducing the
> results? It's entirely possible I did some silly mistake. You'll need to
> adjust a couple hard-coded paths in the script. If you don't have access
> to suitable hardware, I may be able to provide something.
>
> It's also possible we do the compression wrong in some way, making it
> much more expensive. For pglz that's unlikely, because the API is pretty
> simple. And we know pglz is a bit slow. For Lz4 there are multiple ways
> to do the compression, so maybe we're not using the right interface? Or
> maybe we could tune the compression level somehow? Not sure.
>
> It's also possible the benchmark is too simplistic. For example, maybe
> the results would be much more positive if the storage (and page cache)
> was more utilized. For example, if there was a concurrent pgbench with
> large scale, the compression might help a lot.
>
> But that's not an excuse to cause regressions for systems that have
> enough RAM / lightly utilized storage (and I assume most systems will be
> like that). I don't think a GUC is a good answer to this. If there was a
> clear class of systems that universally (and significantly) benefit from
> the compression, then maybe. But the gains seem fairly limited.
>
> I'd suggest reviewing my benchmark script and making sure I haven't made
> some silly mistake, maybe try constructing your own test. And then maybe
> check it there's a way to do the compression faster (at least for lz4
> there might be some hope). If not, we should probably cut our losses.
>
> I feel rather awful about this, mostly because I'm the one who suggested
> working on this back in 2024. Finding out after ~14 months it may not
> actually be a good idea feels pretty sad. I hope you at least learned a
> little bit about the development process, and will try again with a
> different patch ...
>
>
> regards
>
> --
> Tomas Vondra
>
| Attachment | Content-Type | Size |
|---|---|---|
| hashjoin-ssd-xeon.pdf | application/pdf | 49.9 KB |
| hashjoin-hdd-xeon.pdf | application/pdf | 50.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | SATYANARAYANA NARLAPURAM | 2026-05-11 07:12:29 | SUBSCRIPTION SERVER ALTER/DROP operations stuck when user mapping is dropped |
| Previous Message | Ayush Tiwari | 2026-05-11 06:57:43 | Re: Plug-in coverage hole for pglz_decompress() |