Re: [PATCH] Compression dictionaries for JSONB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nikita Malakhov <hukutoc(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Jacob Champion <jchampion(at)timescale(dot)com>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2023-02-05 19:06:02
Message-ID: 20230205190602.m7kgendxow6j2a3y@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-05 20:05:51 +0300, Aleksander Alekseev wrote:
> > I don't think we'd want much of the infrastructure introduced in the
> > patch for type agnostic cross-row compression. A dedicated "dictionary"
> > type as a wrapper around other types IMO is the wrong direction. This
> > should be a relation-level optimization option, possibly automatic, not
> > something visible to every user of the table.
>
> So to clarify, are we talking about tuple-level compression? Or
> perhaps page-level compression?

Tuple level.

What I think we should do is basically this:

When we compress datums, we know the table being targeted. If there's a
pg_attribute parameter indicating we should, we can pass a prebuilt
dictionary to the LZ4/zstd [de]compression functions.

It's possible we'd need to use a somewhat extended header for such
compressed datums, to reference the dictionary "id" to be used when
decompressing, if the compression algorithms don't already have that in
one of their headers, but that's entirely doable.

A quick demo of the effect size:

# create data to train dictionary with, use subset to increase realism
mkdir /tmp/pg_proc_as_json/;
CREATE EXTENSION adminpack;
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true)
FROM pg_proc pp
LIMIT 2000;

# build dictionary
zstd --train -o /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

# create more data
SELECT pg_file_write('/tmp/pg_proc_as_json/'||oid||'.raw', to_json(pp)::text, true) FROM pg_proc pp;

# compress without dictionary
lz4 -k -m /tmp/pg_proc_as_json/*.raw
zstd -k /tmp/pg_proc_as_json/*.raw

# measure size
cat /tmp/pg_proc_as_json/*.raw|wc -c; cat /tmp/pg_proc_as_json/*.lz4|wc -c; cat /tmp/pg_proc_as_json/*.zst|wc -c

# compress with dictionary
rm -f /tmp/pg_proc_as_json/*.{lz4,zst};
lz4 -k -D /tmp/pg_proc_as_json.dict -m /tmp/pg_proc_as_json/*.raw
zstd -k -D /tmp/pg_proc_as_json.dict /tmp/pg_proc_as_json/*.raw

did the same with zstd.

Here's the results:

lz4 zstd uncompressed
no dict 1328794 982497 3898498
dict 375070 267194

I'd say the effect of the dictionary is pretty impressive. And remember,
this is with the dictionary having been trained on a subset of the data.

As a comparison, here's all the data compressed compressed at once:

lz4 zstd
no dict 180231 104913
dict 179814 106444

Unsurprisingly the dictionary doesn't help much, because the compression
algorithm can "natively" see the duplication.

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Dolgov 2023-02-05 19:56:00 Re: pg_stat_statements and "IN" conditions
Previous Message Andrey Borodin 2023-02-05 18:36:39 Re: pglz compression performance, take two