Re: [PATCH] Compression dictionaries for JSONB

From: Aleksander Alekseev <aleksander(at)timescale(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nikita Malakhov <hukutoc(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Jacob Champion <jchampion(at)timescale(dot)com>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2023-02-05 17:05:51
Message-ID: CAJ7c6TNgq3O9SVXcpUXs0gVuBzfD_22SGZmCKUC4dj84nc8j7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

> I assume that manually specifying dictionary entries is a consequence of
> the prototype state? I don't think this is something humans are very
> good at, just analyzing the data to see what's useful to dictionarize
> seems more promising.

No, humans are not good at it. The idea was to automate the process
and build the dictionaries automatically e.g. during the VACUUM.

> I don't think we'd want much of the infrastructure introduced in the
> patch for type agnostic cross-row compression. A dedicated "dictionary"
> type as a wrapper around other types IMO is the wrong direction. This
> should be a relation-level optimization option, possibly automatic, not
> something visible to every user of the table.

So to clarify, are we talking about tuple-level compression? Or
perhaps page-level compression?

Implementing page-level compression should be *relatively*
straightforward. As an example this was previously done for InnoDB.
Basically InnoDB compresses the entire page, then rounds the result to
1K, 2K, 4K, 8K, etc and stores the result in a corresponding fork
("fork" in PG terminology), similarly to how a SLAB allocator works.
Additionally a page_id -> fork_id map should be maintained, probably
in yet another fork, similarly to visibility map. A compressed page
can change the fork after being modified since this may change the
size of a compressed page. The buffer manager is unaffected and deals
only with uncompressed pages. (I'm not an expert in InnoDB and this is
my very rough understanding of how its compression works.)

I believe this can be implemented as a TAM. Whether this would be a
"dictionary" compression is debatable but it gives the users similar
benefits, give or take. The advantage is that users shouldn't define
any dictionaries manually, nor should DBMS during VACUUM or somehow
else.

> I also suspect that we'd have to spend a lot of effort to make
> compression/decompression fast if we want to handle dictionaries
> ourselves, rather than using the dictionary support in libraries like
> lz4/zstd.

That's a reasonable concern, can't argue with that.

> I don't think a prototype-y patch not needing a rebase two months is a
> good measure of complexity :)

It's worth noting that I also invested quite some time into reviewing
type-aware TOASTers :) I just choose to keep my personal opinion about
the complexity of that patch to myself this time since obviously I'm a
bit biased. However if you are curious it's all in the corresponding
thread.

--
Best regards,
Aleksander Alekseev

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2023-02-05 17:24:03 Re: Make EXPLAIN generate a generic plan for a parameterized query
Previous Message Andres Freund 2023-02-05 16:40:30 Re: File descriptors in exec'd subprocesses