Re: [HACKERS] Custom compression methods

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Nikita Glukhov <n(dot)gluhov(at)postgrespro(dot)ru>
Subject: Re: [HACKERS] Custom compression methods
Date: 2017-12-03 02:52:13
Message-ID: CAF4Au4z-7_Ya4qO=PBK37dyY-Up+CqT0v_ucqXUV8Zanu4eNLw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 2, 2017 at 6:04 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> On 12/01/2017 10:52 PM, Andres Freund wrote:
>> On 2017-12-01 16:14:58 -0500, Robert Haas wrote:
>>> Honestly, if we can give everybody a 4% space reduction by
>>> switching to lz4, I think that's totally worth doing -- but let's
>>> not make people choose it, let's make it the default going forward,
>>> and keep pglz support around so we don't break pg_upgrade
>>> compatibility (and so people can continue to choose it if for some
>>> reason it works better in their use case). That kind of improvement
>>> is nothing special in a specific workload, but TOAST is a pretty
>>> general-purpose mechanism. I have become, through a few bitter
>>> experiences, a strong believer in the value of trying to reduce our
>>> on-disk footprint, and knocking 4% off the size of every TOAST
>>> table in the world does not sound worthless to me -- even though
>>> context-aware compression can doubtless do a lot better.
>>
>> +1. It's also a lot faster, and I've seen way way to many workloads
>> with 50%+ time spent in pglz.
>>
>
> TBH the 4% figure is something I mostly made up (I'm fake news!). On the
> mailing list archive (which I believe is pretty compressible) I observed
> something like 2.5% size reduction with lz4 compared to pglz, at least
> with the compression levels I've used ...

Nikita Glukhove tested compression on real json data:

Delicious bookmarks (js):

json 1322MB
jsonb 1369MB
jsonbc 931MB 1.5x
jsonb+lz4d 404MB 3.4x

Citus customer reviews (jr):

json 1391MB
jsonb 1574MB
jsonbc 622MB 2.5x
jsonb+lz4d 601MB 2.5x

I also attached a plot with wired tiger size (zstd compression) in Mongodb.
Nikita has more numbers about compression.

>
> Other algorithms (e.g. zstd) got significantly better compression (25%)
> compared to pglz, but in exchange for longer compression. I'm sure we
> could lower compression level to make it faster, but that will of course
> hurt the compression ratio.
>
> I don't think switching to a different compression algorithm is a way
> forward - it was proposed and explored repeatedly in the past, and every
> time it failed for a number of reasons, most of which are still valid.
>
>
> Firstly, it's going to be quite hard (or perhaps impossible) to find an
> algorithm that is "universally better" than pglz. Some algorithms do
> work better for text documents, some for binary blobs, etc. I don't
> think there's a win-win option.
>
> Sure, there are workloads where pglz performs poorly (I've seen such
> cases too), but IMHO that's more an argument for the custom compression
> method approach. pglz gives you good default compression in most cases,
> and you can change it for columns where it matters, and where a
> different space/time trade-off makes sense.
>
>
> Secondly, all the previous attempts ran into some legal issues, i.e.
> licensing and/or patents. Maybe the situation changed since then (no
> idea, haven't looked into that), but in the past the "pluggable"
> approach was proposed as a way to address this.

I don't think so. Pluggable means that now we have more data types,
which don't fit to
the old compression scheme of TOAST and we need better flexibility. I
see in future we
could avoid decompression of the whole toast just to get on key from
document, so we
first slice data and compress each slice separately.

>
>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Attachment Content-Type Size
Screen Shot 2017-12-03 at 11.46.14.png image/png 183.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message chenhj 2017-12-03 03:27:57 Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files
Previous Message Jeff Janes 2017-12-03 01:27:51 Re: Bitmap scan is undercosted?