Re: [PATCH] Compression dictionaries for JSONB

From: Nikita Malakhov <hukutoc(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Jacob Champion <jchampion(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Compression dictionaries for JSONB
Date: 2023-02-07 06:11:52
Message-ID: CAN-LCVMg6ntnrjWFbHnuWEAMiJa_07+3bgHyaLApJu_igw9Y4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On updating dictionary -

>You cannot "just" refresh a dictionary used once to compress an
>object, because you need it to decompress the object too.

and when you have many - updating an existing dictionary requires
going through all objects compressed with it in the whole database.
It's a very tricky question how to implement this feature correctly.
Also, there are some thoughts on using JSON schema to optimize
storage for JSON objects.
(That's applied to the TOAST too, so at first glance we've decided
to forbid dropping or changing TOAST implementations already
registered in a particular database.)

In my experience, in modern world, even with fast SSD storage
arrays, with large database (about 40-50 Tb) we had disk access
as a bottleneck more often than CPU, except for the cases with
a lot of parallel execution threads for a single query (Oracle).

On Mon, Feb 6, 2023 at 10:33 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2023-02-06 16:16:41 +0100, Matthias van de Meent wrote:
> > On Mon, 6 Feb 2023 at 15:03, Aleksander Alekseev
> > <aleksander(at)timescale(dot)com> wrote:
> > >
> > > Hi,
> > >
> > > I see your point regarding the fact that creating dictionaries on a
> > > training set is too beneficial to neglect it. Can't argue with this.
> > >
> > > What puzzles me though is: what prevents us from doing this on a page
> > > level as suggested previously?
> >
> > The complexity of page-level compression is significant, as pages are
> > currently a base primitive of our persistency and consistency scheme.
>
> +many
>
> It's also not all a panacea performance-wise, datum-level decompression can
> often be deferred much longer than page level decompression. For things
> like
> json[b], you'd hopefully normally have some "pre-filtering" based on proper
> columns, before you need to dig into the json datum.
>
> It's also not necessarily that good, compression ratio wise. Particularly
> for
> wider datums you're not going to be able to remove much duplication,
> because
> there's only a handful of tuples. Consider the case of json keys - the
> dictionary will often do better than page level compression, because it'll
> have the common keys in the dictionary, which means the "full" keys never
> will
> have to appear on a page, whereas page-level compression will have the
> keys on
> it, at least once.
>
> Of course you can use a dictionary for page-level compression too, but the
> gains when it works well will often be limited, because in most OLTP usable
> page-compression schemes I'm aware of, you can't compress a page all that
> far
> down, because you need a small number of possible "compressed page sizes".
>
>
> > > More similar data you compress the more space and disk I/O you save.
> > > Additionally you don't have to compress/decompress the data every time
> > > you access it. Everything that's in shared buffers is uncompressed.
> > > Not to mention the fact that you don't care what's in pg_attribute,
> > > the fact that schema may change, etc. There is a table and a
> > > dictionary for this table that you refresh from time to time. Very
> > > simple.
> >
> > You cannot "just" refresh a dictionary used once to compress an
> > object, because you need it to decompress the object too.
>
> Right. That's what I was trying to refer to when mentioning that we might
> need
> to add a bit of additional information to the varlena header for datums
> compressed with a dictionary.
>
> Greetings,
>
> Andres Freund
>

--
Regards,

--
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-02-07 06:12:51 Re: A problem in deconstruct_distribute_oj_quals
Previous Message Amit Kapila 2023-02-07 05:28:49 Re: Time delayed LR (WAS Re: logical replication restrictions)