From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Filip Janus <fjanus(at)redhat(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Proposal: Adding compression of temporary files |
Date: | 2025-10-01 15:53:26 |
Message-ID: | 8c9cd489-9d46-48bb-9a8d-64f4536a2abc@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 9/30/25 14:42, Tomas Vondra wrote:
>
> v20250930-0018-undo-unncessary-changes-to-Makefile.patch
>
> - Why did the 0001 patch add this? Maybe it's something we should add
> separately, not as part of this patch?
>
I realized this bit is actually necessary, to make the EXTRA_TESTS work
for the lz4 regression test. The attached patch series skips this bit.
There's also experimental patches adding gzip (or rather libz) and zstd
compression. This is very rough, I just wanted to see how would these
perform compared to pglz/lz4. But I haven't done any proper evaluation
so far, beyond running a couple simple queries. Will try to spend a bit
more time on that soon.
I still wonder about the impact of stream compression. I know it can
improve the compression ratio, but I'm not sure if it also helps with
the compression speed. I think for temporary files faster compression
(and lower ratio) may be a better trade off. So maybe we should user
lower compression levels ...
Attached are two PDF files with results of the perf evaluation using
TPC-H 10GB and 50GB data sets. One table shows timings for 22 queries
with compression set to no/pglz/lz4, for a range of parameter
combinations (work_mem, parallel workers). The other shows amount of
temporary files (in MBs) generated by each query.
The timing shows that pglz is pretty slow, about doubling duration for
some of the queries. That's not surprising, we know pglz can be slow.
lz4 is almost perfectly neutral, which is actually great - the goal is
to reduce I/O pressure for temporary files, but with a single query
running at a time, that's not a problem. So "no impact" is about the
best we can do, it shows the lz4 overhead is negligible.
For "size" PDF shows that the compression can save a fair amount of temp
space. For many queries it saves 50-70% of temporary space. A good
example is Q9 which (on the 50GB scale) used to take about 33GB, and
with compression it's down to ~17GB (with both pglz and lz4). That's
pretty good, I think.
FWIW the "size" results may be a bit misleading, in that it measures
tempfile size for the whole query. But some may use multiple temporary
files, and some may not support compression (e.g. tuplesort don't).
Which will make the actual compression ratio look lower. OTOH it's a
more representative of impact on actual queries.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Fujii Masao | 2025-10-01 16:22:19 | Re: Suggestion to add --continue-client-on-abort option to pgbench |
Previous Message | Frits Hoogland | 2025-10-01 15:25:00 | Re: The ability of postgres to determine loss of files of the main fork |