Re: [PATCH] Compression dictionaries for JSONB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nikita Malakhov <hukutoc(at)gmail(dot)com>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Jacob Champion <jchampion(at)timescale(dot)com>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2023-02-10 20:01:45
Message-ID: 20230210200145.sm3bt5xidlqx6ukt@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-10 21:22:14 +0300, Nikita Malakhov wrote:
> If I understand Andres' message correctly - the proposition is to
> make use of compression dictionaries automatic, possibly just setting
> a parameter when the table is created, something like
> CREATE TABLE t ( ..., t JSONB USE DICTIONARY);

I definitely wouldn't make it automatic initially, and then later see how well
that works.

Whether automatic or not, it probably makes sense to integrate building
dictionaries with analyze. We can build a dictionaries from sampled datums, we
can check how efficient a dictionary is when compressing sampled datums, we
can compare the efficiency of a new dictionary with the existing dictionary to
see whether it's worth a new one.

> The question is then how to create such dictionaries automatically
> and extend them while data is being added to the table. Because
> it is not something unusual when after a time circumstances change
> and a rather small table is started to be loaded with huge amounts
> of data.

It doesn't really make sense to create the dictionaries with small tables,
anyway. For them to be efficient, you need a reasonable amount of data to
build a dictionary from.

> I prefer extending a dictionary over re-creating it because while
> dictionary is recreated we leave users two choices - to wait until
> dictionary creation is over or to use the old version (say, kept as
> as a snapshot while a new one is created). Keeping many versions
> simultaneously does not make sense and would extend DB size.

I don't think you really can extend dictionaries. The references into the
dictionary are as small as possible, based on the contents of the
dictonary. And you normally want to keep the dictionary size bounded for that
reason alone, but also for [de]compression speed.

So you'd need build a new dictionary, and use that going forward. And yes,
you'd not be able to delete the old dictionary, because there will still be
references to the old one.

We could add a command to scan the data to see if an old dictionary is still
used, or even remove references to it. But I don't think it's a particularly
important: It only makes sense to create the initial directory once some data
has accumulated, and creating further dictionaries only makes sense once the
table is a good bit larger. At that point the size of another dictionary to be
stored isn't relevant in relation (otherwise you'd just give up after building
a new dictionary and evaluating its effectiveness).

> Also, compressing small data with a large dictionary (the case for
> one-for-many tables dictionary), I think, would add some considerable
> overhead to the INSERT/UPDATE commands, so the most reasonable
> choice is a per-table dictionary.

Likely even per-column, but I can see some advantages in either appraoch.

> Any ideas on how to create and extend such dictionaries automatically?

As I said, I don't think we should extend dictionaries. For this to work we'll
likely need a new / extended compressed toast datum header of some form, with
a reference to the dictionary. That'd likely be needed even with updatable
dictionaries, as we IIRC don't know which column a toasted datum is for, but
we need to know, to identify the dictionary. As we need that field anyway, we
can easily have multiple dictionaries.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2023-02-10 20:09:06 possible memory leak in VACUUM ANALYZE
Previous Message Nathan Bossart 2023-02-10 20:00:37 Re: pg_usleep for multisecond delays