| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Filip Janus <fjanus(at)redhat(dot)com> |
| Cc: | lakshmi <lakshmigcdac(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, zsolt(dot)parragi(at)percona(dot)com |
| Subject: | Re: Proposal: Adding compression of temporary files |
| Date: | 2026-05-12 14:13:57 |
| Message-ID: | a5d90fde-a68a-48ba-ba8d-d2717cb10396@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 5/11/26 09:09, Filip Janus wrote:
>
>
> Hi Tomas,
>
> Thanks for the thorough benchmark and the script -- it was very helpful
> as a starting point for my testing. I understand the results on
> your machine were discouraging, and I appreciate the honest assessment.
>
> I ran a similar benchmark on different x86_64 hardware to see how the
> results change under more I/O pressure. The short version: lz4 and
> zstd show significant speedups once storage or page cache becomes a
> bottleneck.
>
I'm glad you didn't just give up and decided to run some more tests.
> Setup
> -----
>
> I used your run-hashjoins.sh as a base, with the same parameters:
> 100M rows, d in {1, 10, 100, 1000}, w in {1, 4, 8}, drop-caches
> between runs. I also added zstd to the compression methods tested,
> and tested with a larger compression block size (32 KB instead of
> the default 8 KB BLCKSZ).
>
> Two x86_64 machines:
>
> (A) HPE BL460c Gen10, 2x Xeon Gold 6148, 64 GB RAM,
> rotational HDD (5 disks), io_uring, Fedora 43
>
> (B) Dell MX840c, Xeon Gold 6148, SATA SSD (~224 GB),
> RAM capped to 16 GB via systemd MemoryMax
>
> Both use 32 KB compression blocks (COMPRESS_BLCKSZ = 4*BLCKSZ).
>
What is COMPRESS_BLCKSZ? I don't see that in the patch anywhere. What am
I missing?
> Results
> -------
>
> Below are the relative timings (% of uncompressed baseline), directly
> comparable to your table. Values below 100% mean compression is faster.
>
> Your results (Xeon, 64 GB, SSD/NVMe, 8 KB blocks):
>
> pglz lz4
> rows rep 1 4 8 1 4 8
> -------------------------------------------------
> 10 1 661 688 300 144 148 86
> 10 1000 460 472 234 119 119 58
> 100 1 471 303 204 132 135 102
> 100 1000 378 262 164 107 91 81
>
> Our results, machine A -- x86 HDD, 64 GB, 32 KB blocks:
>
> pglz lz4 zstd
> rows rep 1 4 8 1 4 8 1 4 8
> ----------------------------------------------------------------
> 100 1 200 119 69 91 82 67 80 50 35
> 100 10 204 101 70 91 64 66 83 44 39
> 100 100 220 104 72 94 75 69 85 50 34
> 100 1000 170 92 54 79 58 52 74 42 28
>
> Our results, machine B -- x86 SATA SSD, 16 GB cap, 32 KB blocks:
>
> pglz lz4 zstd
> rows rep 1 4 8 1 4 8 1 4 8
> ----------------------------------------------------------------
> 100 1 284 103 79 92 81 82 98 59 53
> 100 10 262 99 77 92 80 85 96 57 50
> 100 100 221 89 67 80 70 64 85 49 44
> 100 1000 155 51 42 72 39 39 77 27 29
>
> Analysis
> --------
>
> I think the key difference is page cache pressure. Your machine has
> 64 GB RAM with 8 GB shared_buffers, leaving ~56 GB for the OS page
> cache. Even with 8 connections x ~10 GB temp files = ~80 GB, a large
> portion stays cached and synchronous I/O to storage is limited.
>
> On our machines, I/O is a real bottleneck:
> - Machine A: rotational HDD with 8 concurrent streams
> - Machine B: SATA SSD but only 16 GB RAM, so the page cache
> cannot absorb 8 x 12 GB of temp data
>
> Under these conditions, reducing the bytes written translates
> directly into wall-clock savings.
>
Seems like that. It's not a huge surprise that this matters more on
systems with memory pressure and slower storage. I should have tested
that on my machines too.
I was going to question how common such systems are nowadays, when
people can just spin a VM with plenty of RAM and SSDs. But given the
current RAM shortage / costs, and relatively slow network storage (even
if temporary files can use ephemeral disks), maybe it's not all that
uncommon ...
> Both your results and ours confirm that pglz is simply too slow for
> this use case. Your benchmark shows 164-688% overhead; ours shows
> 155-284% with w=1. Even under heavy I/O contention (w=8 on HDD)
> where pglz eventually wins, it never outperforms lz4 or zstd. I
> would recommend against offering pglz for temp file compression
> altogether -- it creates a trap for users who might try it expecting
> reasonable performance.
>
Right.
> lz4 looks safe: the worst case in our data is 94% (w=1, d=100 on
> HDD) -- barely distinguishable from noise. Under I/O pressure it
> delivers 39-52% of baseline time (2-2.5x speedup).
>
> zstd is the most compelling option: it achieves the best compression
> ratios (down to 22% of original size on the SATA SSD) and the best
> speedups (27-28% of baseline = 3.5x faster), with no regression
> exceeding 98% on x86_64. I would recommend zstd as the primary
> option to document, with lz4 as a lighter-weight alternative.
>
Agreed. lz4 seems safe, zstd is good too. I wonder how much this depends
on the particular data set (e.g. if we generate data differently, how
much would it affect the results).
> Compression block size
> ----------------------
>
> I also tested 8 KB, 32 KB, and 64 KB compression block sizes.
> 32 KB appears to be the sweet spot. Example for lz4, d=1000, w=8
> on HDD:
>
> COMPRESS_BLCKSZ time (% of no) compressed bytes
> --------------------------------------------------------
> 8 KB (BLCKSZ) 58% 7.47 GB
> 32 KB (4*BLCKSZ) 52% 7.22 GB
> 64 KB (8*BLCKSZ) 56% 7.14 GB
>
> The 8K-to-32K improvement comes from fewer compress/decompress calls
> (4x fewer), less per-block header overhead, and better compression
> ratios. Going to 64K shows diminishing returns and slightly worse
> timings, possibly due to increased cache pressure.
>
I'm still not quite sure what "compression block size" means here, and
how did you change it.
> Conclusion
> ----------
>
> I think the data shows that the benefit of temporary file compression
> depends heavily on the I/O characteristics of the system. On machines
> with fast storage and ample page cache, compression is neutral -- it
> means negligible overhead, which is a good outcome on its own. On
> systems with real I/O pressure -- slower storage, limited RAM, or
> concurrent workloads competing for page cache -- compression delivers
> substantial speedups.
>
True.
> The feature does not need to be enabled by default. Compression is
> controlled by the temp_file_compression GUC, which defaults to "none".
> That means there is no risk of regression for existing users. But for
> administrators who know their systems are I/O-constrained -- spinning
> disks, limited memory, heavy concurrent spilling -- having the option
> to enable lz4 or zstd can make a real difference. The data above shows
> up to 3.5x speedup in those scenarios, with no
> downside when the setting is left at its default.
>
Yes, having it as opt-in for systems where it matters helps.
What bothers me a little bit is that systems generally are not under
such pressure 24/7, but only for some part of a day. But people will
mostly set the GUC in the config file. I don't have a better solution to
this, though.
FYI I won't be able to do much work on this until ~June.
regards
--
Tomas Vondra
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tomas Vondra | 2026-05-12 14:43:48 | Re: [PATCH] Fix infinite recursion when foreign table references itself |
| Previous Message | John Mikk | 2026-05-12 14:06:08 | [PATCH] Fix infinite recursion when foreign table references itself |