Re: pglz performance

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Vladimir Leskov <vladimirlesk(at)yandex-team(dot)ru>
Subject: Re: pglz performance
Date: 2019-08-02 15:40:51
Message-ID: DBB2A9E5-29FD-40BF-AC60-BD990FBF142F@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for looking into this!

> 2 авг. 2019 г., в 19:43, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> написал(а):
>
> On Fri, Aug 02, 2019 at 04:45:43PM +0300, Konstantin Knizhnik wrote:
>>
>> It takes me some time to understand that your memcpy optimization is correct;)
Seems that comments are not explanatory enough... will try to fix.

>> I have tested different ways of optimizing this fragment of code, but failed tooutperform your implementation!
JFYI we tried optimizations with memcpy with const size (optimized into assembly instead of call), unrolling literal loop and some others. All these did not work better.

>> But ... below are results for lz4:
>>
>> Decompressor score (summ of all times):
>> NOTICE: Decompressor lz4_decompress result 3.660066
>> Compressor score (summ of all times):
>> NOTICE: Compressor lz4_compress result 10.288594
>>
>> There is 2 times advantage in decompress speed and 10 times advantage in compress speed.
>> So may be instead of "hacking" pglz algorithm we should better switch to lz4?
>>
>
> I think we should just bite the bullet and add initdb option to pick
> compression algorithm. That's been discussed repeatedly, but we never
> ended up actually doing that. See for example [1].
>
> If there's anyone willing to put some effort into getting this feature
> over the line, I'm willing to do reviews & commit. It's a seemingly
> small change with rather insane potential impact.
>
> But even if we end up doing that, it still makes sense to optimize the
> hell out of pglz, because existing systems will still use that
> (pg_upgrade can't switch from one compression algorithm to another).

We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.

Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to any data with common substrings: this will enhance compression ratio.
It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate function which produces some "adapted common prefix" from users's data.
Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which common prefix to use.
But also system command can say "invoke decompression from extension".

Thus, user will be able to train database compression on his data and substitute pglz compression with custom compression method seamlessly.

This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4, zstd, brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever gun user wants for his foot.

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jesper Pedersen 2019-08-02 16:03:09 Re: Index Skip Scan
Previous Message Martijn van Oosterhout 2019-08-02 15:40:17 [PATCH] Improve performance of NOTIFY over many databases (v2)