Re: [PATCH] Compression dictionaries for JSONB

From: Jacob Champion <jchampion(at)timescale(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2022-06-01 21:29:06
Message-ID: CAAWbhmgtHaVdh4cDHcDWOd7m4KL6RfBcdtYgy5WDSKN=Sucb5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 1, 2022 at 1:44 PM Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:
> This is a follow-up thread to `RFC: compression dictionaries for JSONB` [1]. I would like to share my current progress in order to get early feedback. The patch is currently in a draft state but implements the basic functionality. I did my best to account for all the great feedback I previously got from Alvaro and Matthias.

I'm coming up to speed with this set of threads -- the following is
not a complete review by any means, and please let me know if I've
missed some of the history.

> SELECT * FROM pg_dict;
> oid | dicttypid | dictentry
> -------+-----------+-----------
> 39476 | 39475 | aaa
> 39477 | 39475 | bbb
> (2 rows)

I saw there was some previous discussion about dictionary size. It
looks like this approach would put all dictionaries into a shared OID
pool. Since I don't know what a "standard" use case is, is there any
risk of OID exhaustion for larger deployments with many dictionaries?
Or is 2**32 so comparatively large that it's not really a serious
concern?

> When `mydict` sees 'aaa' in the document, it replaces it with the corresponding code, in this case - 39476. For more details regarding the compression algorithm and choosen compromises please see the comments in the patch.

I see the algorithm description, but I'm curious to know whether it's
based on some other existing compression scheme, for the sake of
comparison. It seems like it shares similarities with the Snappy
scheme?

Could you talk more about what the expected ratios and runtime
characteristics are? Best I can see is that compression runtime is
something like O(n * e * log d) where n is the length of the input, e
is the maximum length of a dictionary entry, and d is the number of
entries in the dictionary. Since e and d are constant for a given
static dictionary, how well the dictionary is constructed is
presumably important.

> In pg_type `mydict` has typtype = TYPTYPE_DICT. It works the same way as TYPTYPE_BASE with only difference: corresponding `<type>_in` (pg_type.typinput) and `<another-type>_<type>` (pg_cast.castfunc) procedures receive the dictionary Oid as a `typmod` argument. This way the procedures can distinguish `mydict1` from `mydict2` and use the proper compression dictionary.
>
> The approach with alternative `typmod` role is arguably a bit hacky, but it was the less invasive way to implement the feature I've found. I'm open to alternative suggestions.

Haven't looked at this closely enough to develop an opinion yet.

> Current limitations (todo):
> - ALTER TYPE is not implemented

That reminds me. How do people expect to generate a "good" dictionary
in practice? Would they somehow get the JSONB representations out of
Postgres and run a training program over the blobs? I see some
reference to training functions in the prior threads but don't see any
breadcrumbs in the code.

> - Alternative compression algorithms. Note that this will not require any further changes in the catalog, only the values we write to pg_type and pg_cast will differ.

Could you expand on this? I.e. why would alternative algorithms not
need catalog changes? It seems like the only schemes that could be
used with pg_catalog.pg_dict are those that expect to map a byte
string to a number. Is that general enough to cover other standard
compression algorithms?

> Open questions:
> - Dictionary entries are currently stored as NameData, the same type that is used for enums. Are we OK with the accompanying limitations? Any alternative suggestions?

It does feel a little weird to have a hard limit on the entry length,
since that also limits the compression ratio. But it also limits the
compression runtime, so maybe it's a worthwhile tradeoff.

It also seems strange to use a dictionary of C strings to compress
binary data; wouldn't we want to be able to compress zero bytes too?

Hope this helps,
--Jacob

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-06-01 21:30:13 Re: silence compiler warning in brin.c
Previous Message Tom Lane 2022-06-01 21:24:50 Re: showing effective max_connections