Re: [PATCH] Compression dictionaries for JSONB

From: Aleksander Alekseev <aleksander(at)timescale(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Nikita Malakhov <hukutoc(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Jacob Champion <jchampion(at)timescale(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2023-04-18 17:21:14
Message-ID: CAJ7c6TPN3Vww95YHMrgMjyHoRuz7GpDotyoD3kPpVxt900VLUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Matthias, Nikita,

Many thanks for the feedback!

> Any type with typlen < 0 should work, right?

Right.

> The use of dictionaries should be dependent on only the use of a
> compression method that supports pre-computed compression
> dictionaries. I think storage=MAIN + compression dictionaries should
> be supported, to make sure there is no expensive TOAST lookup for the
> attributes of the tuple; but that doesn't seem to be an option with
> that design.

> I don't think it's a good idea to interfere with the storage strategies. Dictionary
> should be a kind of storage option, like a compression, but not the strategy
> declining all others.

My reasoning behind this proposal was as follows.

Let's not forget that MAIN attributes *can* be stored in a TOAST table
as a final resort, and also that EXTENDED attributes are compressed
in-place first, and are stored in a TOAST table *only* if this is
needed to fit a tuple in toast_tuple_target bytes (which additionally
user can change). So whether in practice it's going to be advantageous
to distinguish MAIN+dict.compressed and EXTENDED+dict.compressed
attributes seems to be debatable.

Basically the only difference between MAIN and EXTENDED is the
priority the four-stage TOASTing algorithm gives to the corresponding
attributes. I would assume if the user wants dictionary compression,
the attribute should be highly compressible and thus always EXTENDED.
(We seem to use MAIN for types that are not that well compressible.)

This being said, if the majority believes we should introduce a new
entity and keep storage strategies as is, I'm fine with that. This
perhaps is not going to be the most convenient interface for the user.
On the flip side it's going to be flexible. It's all about compromise.

> I think "AT_AC SET COMPRESSION lz4 {[WITH | WITHOUT] DICTIONARY}",
> "AT_AC SET COMPRESSION lz4-dictionary", or "AT_AC SET
> compression_dictionary = on" would be better from a design
> perspective.

> Agree with Matthias on above.

OK, unless someone will object, we have a consensus here.

> Didn't we get zstd support recently as well?

Unfortunately, it is not used for TOAST. In fact I vaguely recall that
ZSTD support for TOAST may have been explicitly rejected. Don't quote
me on that however...

I think it's going to be awkward to support PGLZ/LZ4 for COMPRESSION
and LZ4/ZSTD for dictionary compression. As a user personally I would
prefer having one set of compression algorithms that can be used with
TOAST.

Perhaps for PoC we could focus on LZ4, and maybe PGLZ, if we choose to
use PGLZ for compression dictionaries too. We can always discuss ZSTD
separately.

> Can we specify a default compression method for each postgresql type,
> just like how we specify the default storage? If not, then the setting
> could realistically be in conflict with a default_toast_compression
> setting, assuming that dictionary support is not a requirement for
> column compression methods.

No, only STORAGE can be specified [1].

> The toast pointer must store enough info about the compression used to
> decompress the datum, which implies it needs to store the compression
> algorithm used, and a reference to the compression dictionary (if
> any). I think the idea about introducing a new toast pointer type (in
> the custom toast patch) wasn't bad per se, and that change would allow
> us to carry more or different info in the header.

> The Pluggable TOAST was rejected, but we have a lot of improvements
> based on changing the TOAST pointer structure.

Interestingly it looks like we ended up working on TOAST improvement
after all. I'm almost certain that we will have to modify TOAST
pointers to a certain degree in order to make it work. Hopefully it's
not going to be too invasive.

[1]: https://www.postgresql.org/docs/current/sql-createtype.html
--
Best regards,
Aleksander Alekseev

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-04-18 17:34:12 Re: pgsql: psql: add an optional execution-count limit to \watch.
Previous Message Dagfinn Ilmari Mannsåker 2023-04-18 16:56:43 Re: constants for tar header offsets