Re: [PATCH] Compression dictionaries for JSONB

From: Jacob Champion <jchampion(at)timescale(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2022-06-02 22:21:26
Message-ID: CAAWbhmhzjmyg+v_2OQao+116VHSicg_6+2k+MHSbJDV7cHMcjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 2, 2022 at 6:30 AM Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:
> > I saw there was some previous discussion about dictionary size. It
> > looks like this approach would put all dictionaries into a shared OID
> > pool. Since I don't know what a "standard" use case is, is there any
> > risk of OID exhaustion for larger deployments with many dictionaries?
> > Or is 2**32 so comparatively large that it's not really a serious
> > concern?
>
> I agree, this is a drawback of the current implementation. To be honest,
> I simply followed the example of how ENUMs are implemented. I'm not 100% sure
> if we should be worried here (apparently, freed OIDs are reused). I'm OK with
> using a separate sequence if someone could second this. This is the first time
> I'm altering the catalog so I'm not certain what the best practices are.

I think reuse should be fine (if a bit slower, but offhand that
doesn't seem like an important bottleneck). Users may be unamused to
find that one large dictionary has prevented the creation of any new
entries in other dictionaries, though. But again, I have no intuition
for the size of a production-grade compression dictionary, and maybe
it's silly to assume that normal use would ever reach the OID limit.

> > I see the algorithm description, but I'm curious to know whether it's
> > based on some other existing compression scheme, for the sake of
> > comparison. It seems like it shares similarities with the Snappy
> > scheme?
> >
> > Could you talk more about what the expected ratios and runtime
> > characteristics are? Best I can see is that compression runtime is
> > something like O(n * e * log d) where n is the length of the input, e
> > is the maximum length of a dictionary entry, and d is the number of
> > entries in the dictionary. Since e and d are constant for a given
> > static dictionary, how well the dictionary is constructed is
> > presumably important.
>
> The algorithm is almost identical to the one I used in ZSON extension [1]
> except the fact that ZSON uses 16-bit codes. In docs/benchmark.md you will find
> approximate ratios to expect, etc.

That's assuming a machine-trained dictionary, though, which isn't part
of the proposal now. Is there a performance/ratio sample for a "best
practice" hand-written dictionary?

> > That reminds me. How do people expect to generate a "good" dictionary
> > in practice? Would they somehow get the JSONB representations out of
> > Postgres and run a training program over the blobs? I see some
> > reference to training functions in the prior threads but don't see any
> > breadcrumbs in the code.
>
> So far we agreed that in the first implementation it will be done manually.
> In the future it will be possible to update the dictionaries automatically
> during VACUUM. The idea of something similar to zson_learn() procedure, as
> I recall, didn't get much support, so we probably will not have it, or at least
> it is not a priority.

Hm... I'm skeptical that a manually-constructed set of compression
dictionaries would be maintainable over time or at scale. But I'm not
the target audience so I will let others weigh in here instead.

> > > - Alternative compression algorithms. Note that this will not require any
> > > further changes in the catalog, only the values we write to pg_type and
> > > pg_cast will differ.
> >
> > Could you expand on this? I.e. why would alternative algorithms not
> > need catalog changes? It seems like the only schemes that could be
> > used with pg_catalog.pg_dict are those that expect to map a byte
> > string to a number. Is that general enough to cover other standard
> > compression algorithms?
>
> Sure. When creating a new dictionary pg_type and pg_cast are modified like this:
>
> =# CREATE TYPE mydict AS DICTIONARY OF JSONB ('abcdef', 'ghijkl');
> CREATE TYPE
> =# SELECT * FROM pg_type WHERE typname = 'mydict';
> -[ RECORD 1 ]--+---------------
> oid | 16397
> typname | mydict
> typnamespace | 2200
> ...
> typarray | 16396
> typinput | dictionary_in
> typoutput | dictionary_out
> ...
>
> =# SELECT c.*, p.proname FROM pg_cast AS c
> LEFT JOIN pg_proc AS p
> ON p.oid = c.castfunc
> WHERE c.castsource = 16397 or c.casttarget = 16397;
> -[ RECORD 1 ]-----------------
> oid | 16400
> castsource | 3802
> casttarget | 16397
> castfunc | 9866
> castcontext | a
> castmethod | f
> proname | jsonb_dictionary
> -[ RECORD 2 ]-----------------
> oid | 16401
> castsource | 16397
> casttarget | 3802
> castfunc | 9867
> castcontext | i
> castmethod | f
> proname | dictionary_jsonb
> -[ RECORD 3 ]-----------------
> oid | 16402
> castsource | 16397
> casttarget | 17
> castfunc | 9868
> castcontext | e
> castmethod | f
> proname | dictionary_bytea
>
> In order to add a new algorithm you simply need to provide alternatives
> to dictionary_in / dictionary_out / jsonb_dictionary / dictionary_jsonb and
> specify them in the catalog instead. The catalog schema will remain the same.

The catalog schemas for pg_type and pg_cast would. But would the
current pg_dict schema be generally applicable to other cross-table
compression schemes? It seems narrowly tailored -- which is not a
problem for a proof of concept patch; I'm just not seeing how other
standard compression schemes might make use of an OID-to-NameData map.
My naive understanding is that they have their own dictionary
structures.

(You could of course hack in any general structure you needed by
treating pg_dict like a list of chunks, but that seems wasteful and
slow, especially given the 63-byte chunk limit, and even more likely
to exhaust the shared OID pool. I think LZMA dictionaries can be huge,
as one example.)

> > It also seems strange to use a dictionary of C strings to compress
> > binary data; wouldn't we want to be able to compress zero bytes too?
>
> That's a good point. Again, here I simply followed the example of the ENUMs
> implementation. Since compression dictionaries are intended to be used with
> text-like types such as JSONB, (and also JSON, TEXT and XML in the future),
> choosing Name type seemed to be a reasonable compromise. Dictionary entries are
> most likely going to store JSON keys, common words used in the TEXT, etc.
> However, I'm fine with any alternative scheme if somebody experienced with the
> PostgreSQL catalog could second this.

I think Matthias back in the first thread was hoping for the ability
to compress duplicated JSON objects as well; it seems like that
wouldn't be possible with the current scheme. (Again I have no
intuition for which use cases are must-haves.) I'm wondering if
pg_largeobject would be an alternative catalog to draw inspiration
from... specifically the use of bytea as the stored value, and of a
two-column primary key.

But take all my suggestions with a dash of salt :D I'm new to this space.

Thanks!
--Jacob

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Mary Xu 2022-06-02 22:37:01 oat_post_create expected behavior
Previous Message Thomas Munro 2022-06-02 22:05:29 Re: Assertion failure with barriers in parallel hash join