Re: pglz performance

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Vladimir Leskov <vladimirlesk(at)yandex-team(dot)ru>
Subject: Re: pglz performance
Date: 2019-08-04 09:57:04
Message-ID: CF6BA10B-E36D-4489-BF2B-25F9012ED3CA@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> 2 авг. 2019 г., в 21:39, Andres Freund <andres(at)anarazel(dot)de> написал(а):
>
> On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
>> We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.
>
> I don't understand why it's a good idea to improve the compression side
> of pglz. There's plenty other people that spent a lot of time developing
> better compression algorithms.
Improving compression side of pglz has two different projects:
1. Faster compression with less code and same compression ratio (patch in this thread).
2. Better compression ratio with at least same compression speed of uncompressed values.
Why I want to do patch for 2? Because it's interesting.
Will 1 or 2 be reviewed or committed? I have no idea.
Will many users benefit from 1 or 2? Yes, clearly. Unless we force everyone to stop compressing with pglz.

>> Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to any data with common substrings: this will enhance compression ratio.
>> It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate function which produces some "adapted common prefix" from users's data.
>> Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which common prefix to use.
>> But also system command can say "invoke decompression from extension".
>>
>> Thus, user will be able to train database compression on his data and substitute pglz compression with custom compression method seamlessly.
>>
>> This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4, zstd, brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun user wants for his foot.
>
> I think this is way too complicated, and will provide not particularly
> much benefit for the majority users.
>
> In fact, I'll argue that we should flat out reject any such patch until
> we have at least one decent default compression algorithm in
> core. You're trying to work around a poor compression algorithm with
> complicated dictionary improvement
OK. The idea of something plugged into pglz seemed odd even to me.
But looks like it restarted lz4 discussion :)

> , that require user interaction, and
> only will work in a relatively small subset of the cases, and will very
> often increase compression times.
No, surely, if implementation of "common prefix" will increase compression times I will not even post a patch.
BTW, lz4 also supports "common prefix", let's do that too?
Here's link on Zstd dictionary builder, but it is compatible with lz4
https://github.com/facebook/zstd#the-case-for-small-data-compression
We actually have small datums.

> 4 авг. 2019 г., в 5:41, Petr Jelinek <petr(at)2ndquadrant(dot)com> написал(а):
>
> Just so that we don't idly talk, what do you think about the attached?
> It:
> - adds new GUC compression_algorithm with possible values of pglz (default) and lz4 (if lz4 is compiled in), requires SIGHUP
> - adds --with-lz4 configure option (default yes, so the configure option is actually --without-lz4) that enables the lz4, it's using system library
> - uses the compression_algorithm for both TOAST and WAL compression (if on)
> - supports slicing for lz4 as well (pglz was already supported)
> - supports reading old TOAST values
> - adds 1 byte header to the compressed data where we currently store the algorithm kind, that leaves us with 254 more to add :) (that's an extra overhead compared to the current state)
> - changes the rawsize in TOAST header to 31 bits via bit packing
> - uses the extra bit to differentiate between old and new format
> - supports reading from table which has different rows stored with different algorithm (so that the GUC itself can be freely changed)
That's cool. I suggest defaulting to lz4 if it is available. You cannot start cluster on non-lz4 binaries which used lz4 once.
Do we plan the possibility of compression algorithm as extension? Or will all algorithms be packed into that byte in core?
What about lz4 "common prefix"? System or user-defined. If lz4 is compiled in we can even offer in-system training, just make sure that trained prefixes will make their way to standbys.

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-08-04 11:52:37 Re: pglz performance
Previous Message vignesh C 2019-08-04 09:31:33 Re: Unused header file inclusion