Re: RFC: compression dictionaries for JSONB

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Re: RFC: compression dictionaries for JSONB
Date: 2021-10-11 17:39:02
Message-ID: CAEze2Wiev85+Oj7a0jXNSkBpTeocMx3bzoO7HMF6fUkG6FOAwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 11 Oct 2021 at 15:25, Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:
> Agree, add / change / remove of a column should be handled
> automatically. Just to clarify, by column option do you mean syntax
> like ALTER TABLE ... ALTER COLUMN ... etc, right?

Correct, either SET (option) or maybe using typmod (which is hack-ish,
but could save on some bytes of storage)

> I didn't think of
> extending this part of the syntax. That would be a better choice
> indeed.

> > Overall, I'm glad to see this take off, but I do want some
> > clarifications regarding the direction that this is going towards.
> > [...]
> > Actually, why is it a JSONB_DICTIONARY and not like:
> > CREATE TYPE name AS DICTIONARY (
> > base_type = JSONB, ...
> > );
> > so that it is possible to use the infrastructure for other things? For
> > example, perhaps PostGIS geometries could benefit from it -- or even
> > text or xml columns.
>
> So the question is if we want to extend the capabilities of a single
> type, i.e. JSONB, or to add a functionality that would work for the
> various types. I see the following pros and cons of both approaches.
>
> Modifying JSONB may at some point allow to partially decompress only
> the parts of the document that need to be decompressed for a given
> query. However, the compression schema will be probably less
> efficient. There could also be difficulties in respect of backward
> compatibility, and this is going to work only with JSONB.

Assuming this above is option 1. If I understand correctly, this
option was 'adapt the data type so that it understands how to handle a
shared dictionary, decreasing storage requirements'.

> An alternative approach, CREATE TYPE ... AS DICTIONARY OF <type> or
> something like this would work not only for JSONB, but also for TEXT,
> XML, arrays, and PostGIS. By the way, this was suggested before [1].
> Another advantage here is that all things being equal the compression
> schema could be more efficient. The feature will not affect existing
> types. The main disadvantage is that implementing a partial
> decompression would be very difficult and/or impractical.

Assuming this was the 2nd option. If I understand correctly, this
option is effectively 'adapt or wrap TOAST to understand and handle
dictionaries for dictionary encoding common values'.

> Personally, I would say that the 2nd option, CREATE TYPE ... AS
> DICTIONARY OF <type>, seems to be more useful. To my knowledge, not
> many users would care much about partial decompression, and this is
> the only real advantage of the 1st option I see. Also, this is how
> ZSON is implemented. It doesn't care about the underlying type and
> treats it as a BLOB. Thus the proofs of usefulness I provided above
> are not quite valid for the 1st option. Probably unrelated, but 2nd
> option would be even easier for me to implement since I already solved
> a similar task.
>
> All in all, I suggest focusing on the 2nd option with universal
> compression dictionaries. Naturally, the focus will be on JSONB first.
> But we will be able to extend this functionality for other types as
> well.
>
> Thoughts?

I think that an 'universal dictionary encoder' would be useful, but
that a data type might also have good reason to implement their
replacement methods by themselves for better overall performance (such
as maintaining partial detoast support in dictionaried items, or
overall lower memory footprint, or ...). As such, I'd really
appreciate it if Option 1 is not ruled out by any implementation of
Option 2.

Kind regards,

Matthias van de Meent

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-10-11 17:46:16 Re: BUG #17212: pg_amcheck fails on checking temporary relations
Previous Message Stephen Frost 2021-10-11 17:30:38 Re: storing an explicit nonce