Re: [PATCH] Compression dictionaries for JSONB

From: Nikita Malakhov <hukutoc(at)gmail(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Simon Riggs <simon(dot)riggs(at)enterprisedb(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Zhihong Yu <zyu(at)yugabyte(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2022-08-19 07:57:42
Message-ID: CAN-LCVOn68NnZ-CUc56XfmS+HHK_PoOS3y1RsLbnMhASU3HMbg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers!

I've got a partly question, partly proposal for the future development of
this
feature:
What if we use pg_dict table not to store dictionaries but to store
dictionaries'
meta, and actual dictionaries to be stored in separate tables like it is
done with
TOAST tables (i.e. pg_dict.<dictionary 1 entry> --> pg_dict_16385 table)?
Thus we can kill several birds with one stone - we deal with concurrent
dictionaries' updates - which looks like very serious issue for now, they
do not
affect each other and overall DB performance while using, we get around SQL
statement size restriction, could effectively deal with versions in
dictionaries
and even dictionaries' versions, as well as dictionary size restriction, we
can
use it for duplicated JSON parts, and even we can provide an API to work
with dictionaries and dictionary tables which later could be usable even
for
working with JSON schemas as well (maybe, with some extension)?

Overall structure could look like this:
pg_dict
|
|---- dictionary 1 meta
| |--name
| |--size
| |--etc
| |--dictionary table name (i.e. pg_dict_16385)
| |
| |----> pg_dict_16385
|
|---- dictionary 2 meta
| |--name
| |--size
| |--etc
| |--dictionary table name (i.e. pg_dict_16386)
| |
| |----> pg_dict_16386
...

where dictionary table could look like
pg_dict_16385
|
|---- key 1
| |-value
|
|---- key 2
| |-value
...

And with a special DICT API we would have means to access, cache, store our
dictionaries in a uniform way from different levels. In this implementation
it also
looks as a very valuable addition for our JSONb Toaster.

JSON schema processing is a very promising feature and we have to keep up
with major competitors like Oracle which are already working on it.

On Mon, Aug 1, 2022 at 2:25 PM Aleksander Alekseev <aleksander(at)timescale(dot)com>
wrote:

> Hi hackers,
>
> > So far we seem to have a consensus to:
> >
> > 1. Use bytea instead of NameData to store dictionary entries;
> >
> > 2. Assign monotonically ascending IDs to the entries instead of using
> > Oids, as it is done with pg_class.relnatts. In order to do this we
> > should either add a corresponding column to pg_type, or add a new
> > catalog table, e.g. pg_dict_meta. Personally I don't have a strong
> > opinion on what is better. Thoughts?
> >
> > Both changes should be straightforward to implement and also are a
> > good exercise to newcomers.
> >
> > I invite anyone interested to join this effort as a co-author! (since,
> > honestly, rewriting the same feature over and over again alone is
> > quite boring :D).
>
> cfbot complained that v5 doesn't apply anymore. Here is the rebased
> version of the patch.
>
> > Good point. This was not a problem for ZSON since the dictionary size
> > was limited to 2**16 entries, the dictionary was immutable, and the
> > dictionaries had versions. For compression dictionaries we removed the
> > 2**16 entries limit and also decided to get rid of versions. The idea
> > was that you can simply continue adding new entries, but no one
> > thought about the fact that this will consume the memory required to
> > decompress the document indefinitely.
> >
> > Maybe we should return to the idea of limited dictionary size and
> > versions. Objections?
> > [ ...]
> > You are right. Another reason to return to the idea of dictionary
> versions.
>
> Since no one objected so far and/or proposed a better idea I assume
> this can be added to the list of TODOs as well.
>
> --
> Best regards,
> Aleksander Alekseev
>

--
Regards,
Nikita Malakhov
https://postgrespro.ru/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message mahendrakar s 2022-08-19 08:06:54 Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
Previous Message Peter Smith 2022-08-19 07:46:12 Re: Perform streaming logical transactions by background workers and parallel apply