Re: [PATCH] Compression dictionaries for JSONB

From: Aleksander Alekseev <aleksander(at)timescale(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nikita Malakhov <hukutoc(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Jacob Champion <jchampion(at)timescale(dot)com>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2023-02-06 14:03:07
Message-ID: CAJ7c6TN0b+iBBO5yZm+Tqj-RBzuKAOppdcfvmqz0s2NVztY19Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> > So to clarify, are we talking about tuple-level compression? Or
> > perhaps page-level compression?
>
> Tuple level.
>
> What I think we should do is basically this:
>
> When we compress datums, we know the table being targeted. If there's a
> pg_attribute parameter indicating we should, we can pass a prebuilt
> dictionary to the LZ4/zstd [de]compression functions.
>
> It's possible we'd need to use a somewhat extended header for such
> compressed datums, to reference the dictionary "id" to be used when
> decompressing, if the compression algorithms don't already have that in
> one of their headers, but that's entirely doable.
>
> A quick demo of the effect size:
> [...]
> Here's the results:
>
> lz4 zstd uncompressed
> no dict 1328794 982497 3898498
> dict 375070 267194
>
> I'd say the effect of the dictionary is pretty impressive. And remember,
> this is with the dictionary having been trained on a subset of the data.

I see your point regarding the fact that creating dictionaries on a
training set is too beneficial to neglect it. Can't argue with this.

What puzzles me though is: what prevents us from doing this on a page
level as suggested previously?

More similar data you compress the more space and disk I/O you save.
Additionally you don't have to compress/decompress the data every time
you access it. Everything that's in shared buffers is uncompressed.
Not to mention the fact that you don't care what's in pg_attribute,
the fact that schema may change, etc. There is a table and a
dictionary for this table that you refresh from time to time. Very
simple.

Of course the disadvantage here is that we are not saving the memory,
unlike the case of tuple-level compression. But we are saving a lot of
CPU cycles and doing less disk IOs. I would argue that saving CPU
cycles is generally more preferable. CPUs are still often a bottleneck
while the memory becomes more and more available, e.g there are
relatively affordable (for a company, not an individual) 1 TB RAM
instances, etc.

So it seems to me that doing page-level compression would be simpler
and more beneficial in the long run (10+ years). Don't you agree?

--
Best regards,
Aleksander Alekseev

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2023-02-06 14:30:14 Re: proposal: psql: psql variable BACKEND_PID
Previous Message Erik Wienhold 2023-02-06 13:59:11 Re: Understanding years part of Interval