Re: Proposal: Adding compression of temporary files

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Filip Janus <fjanus(at)redhat(dot)com>, lakshmi <lakshmigcdac(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, zsolt(dot)parragi(at)percona(dot)com
Subject: Re: Proposal: Adding compression of temporary files
Date: 2026-03-25 20:23:54
Message-ID: 1565779c-f3c7-4a81-9834-ad569d7ea19f@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Filip,

Thanks for the updated patch. I finally had some more time to do a
review. I think the code looks pretty good, unfortunately the results of
my performance validation are not very positive :-( That's not your
fault, of course, but I'm not quite sure it can be fixed.

The test I did is fairly simple - execute a hash join that spills data
(which is a case that can be compressed), and measure how long it takes.
And do it from multiple connections concurrently, to spill more data,
possibly more than available RAM.

The attached script runs a hashjoin query, with these parameters:

* rows: 1M, 10M and 100M rows
* duplicates: 1, 10, 100, 1000 (determines compressibility)
* workers: 1, 4, 8 (number of connections)
* compression: no, pglz, lz4

The system has 64GB RAM, shared_buffers was set to 8GB. That leaves
~56GB for system and page cache. The data sizes need to spill about
100MB per 1M rows, so 100M rows means ~10GB of temporary files.

So what behavior would be "OK" in various cases?

With 1M and 10M rows, the temporary files can be kept in memory, even
with 8 connections (we'll write ~8GB temp files in total). The kernel
may evict some of the data to disk, but that happens in the background,
and synchronous I/O is required. I believe the best outcome we can
expect is probably the same duration as without compression.

With 100M rows this generates >10GB of temporary files per connections.
With 8 connections, that's >80GB, which exceeds the page cache capacity,
and so will have to do quite a bit of I/O. In this case we expect a
(hopefully) significant speedup, depending on how compressible the
temporary data are (the higher the "d" value, the better.

With 50% compression. we'd need to write just 40GB, which could even fit
into page cache (and not need I/O at all).

Here are the timings from the "xeon" machine, for 10M and 100M rows. The
attached PDFs have a more complete data from another machine (with two
types of storage). But the behavior is pretty much the same, so let's
focus on this example:

| no | pglz | lz4
rows rep | 1 4 8 | 1 4 8 | 1 4 8
---------------------------------------------------------------
10 1 | 6 6 15 | 40 41 45 | 8 8 12
10 | 6 6 12 | 39 40 43 | 8 8 13
100 | 6 6 13 | 36 37 40 | 8 8 9
1000 | 6 6 13 | 27 28 30 | 7 7 7
100 1 | 76 136 233 | 361 413 477 | 101 184 239
10 | 87 143 226 | 368 398 470 | 110 157 248
100 | 87 128 233 | 367 402 477 | 96 169 247
1000 | 85 138 246 | 322 362 403 | 90 126 198

If we take the "no" compression as a baseline, then the relative timings
look like this:

| pglz | lz4
rows rep | 1 4 8 | 1 4 8
----------------------------------------------------------
10 1 | 661% 688% 300% | 144% 148% 86%
10 | 647% 665% 347% | 143% 145% 106%
100 | 599% 620% 306% | 135% 139% 74%
1000 | 460% 472% 234% | 119% 119% 58%
100 1 | 471% 303% 204% | 132% 135% 102%
10 | 421% 277% 208% | 127% 110% 110%
100 | 421% 313% 204% | 110% 132% 106%
1000 | 378% 262% 164% | 107% 91% 81%

That's not very encouraging, unfortunately.

The pglz causes a massive regression, making it ~6x slower eve when
everything fits into memory, and there's no chance for the compression
to help. It works better for the large case, where it gets "only" 1.6x
slower than no compression. That doesn't seem like a great deal.

With lz4 we do much better, it's only ~1.4x slower, and in a couple
cases it even beats no compression. It actually wins even with 10M rows
and 8 connections, which is interesting. But even this seems a bit
disappointing.

The attached PDFs also show how much data was written to temporary files
(the second chart). It's pretty consistent between pglz/lz4. It's clear
how the "repetitions" parameter affects compressibility, although it's
interesting it gets worse for larger data sets. I assume it's a
consequence of how we write data to a hash table and then spill it,
which likely "mixes up" the data a bit. But I haven't looked into the
details, and I don't think it matters very much.

Can you please review my benchmark script, and maybe try reproducing the
results? It's entirely possible I did some silly mistake. You'll need to
adjust a couple hard-coded paths in the script. If you don't have access
to suitable hardware, I may be able to provide something.

It's also possible we do the compression wrong in some way, making it
much more expensive. For pglz that's unlikely, because the API is pretty
simple. And we know pglz is a bit slow. For Lz4 there are multiple ways
to do the compression, so maybe we're not using the right interface? Or
maybe we could tune the compression level somehow? Not sure.

It's also possible the benchmark is too simplistic. For example, maybe
the results would be much more positive if the storage (and page cache)
was more utilized. For example, if there was a concurrent pgbench with
large scale, the compression might help a lot.

But that's not an excuse to cause regressions for systems that have
enough RAM / lightly utilized storage (and I assume most systems will be
like that). I don't think a GUC is a good answer to this. If there was a
clear class of systems that universally (and significantly) benefit from
the compression, then maybe. But the gains seem fairly limited.

I'd suggest reviewing my benchmark script and making sure I haven't made
some silly mistake, maybe try constructing your own test. And then maybe
check it there's a way to do the compression faster (at least for lz4
there might be some hope). If not, we should probably cut our losses.

I feel rather awful about this, mostly because I'm the one who suggested
working on this back in 2024. Finding out after ~14 months it may not
actually be a good idea feels pretty sad. I hope you at least learned a
little bit about the development process, and will try again with a
different patch ...

regards

--
Tomas Vondra

Attachment Content-Type Size
hashjoin-ryzen.pdf application/pdf 57.5 KB
hashjoin-xeon.pdf application/pdf 47.8 KB
run-hashjoins.sh application/x-shellscript 3.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zsolt Parragi 2026-03-25 20:38:25 Re: VACUUM FULL, CLUSTER, and REPACK block on other sessions' temp tables
Previous Message Alvaro Herrera 2026-03-25 20:12:48 Re: Adding REPACK [concurrently]