Re: Add ZSON extension to /contrib/

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: Add ZSON extension to /contrib/
Date: 2021-05-26 16:43:37
Message-ID: CAEze2WhB6=NCJyWvLm-i3i=24R+PMphp1DXxHRoWMbM3Me1c3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 26 May 2021 at 12:49, Aleksander Alekseev
<aleksander(at)timescale(dot)com> wrote:
>
> Hi hackers,
>
> Many thanks for your feedback, I very much appreciate it!
>
> > If the extension is mature enough, why make it an extension in
> > contrib, and not instead either enhance the existing jsonb type with
> > it or make it a built-in type?
>
> > IMO we have too d*mn many JSON types already. If we can find a way
> > to shoehorn this optimization into JSONB, that'd be great. Otherwise
> > I do not think it's worth the added user confusion.
>
> Magnus, Tom,
>
> My reasoning is that if the problem can be solved with an extension
> there is little reason to modify the core. This seems to be in the
> spirit of PostgreSQL. If the community reaches the consensus to modify
> the core to introduce a similar feature, we could discuss this as
> well. It sounds like a lot of unnecessary work to me though (see
> below).
>
> > * doesn't cover all cases, notably indexes.
>
> Tom,
>
> Not sure if I follow. What cases do you have in mind?
>
> > Do note that e.g. postgis is not in contrib, but is available in e.g. RDS.
>
> Matthias,
>
> Good point. I suspect that PostGIS is an exception though...

Quite a few other non-/common/ extensions are available in RDS[0],
some of which are HLL (from citusdata), pglogical (from 2ndQuadrant)
and orafce (from Pavel Stehule, orafce et al.).

> > I like the idea of the ZSON type, but I'm somewhat disappointed by its
> > current limitations
>
> Several people suggested various enhancements right after learning
> about ZSON. Time showed, however, that none of the real-world users
> really need e.g. more than one common dictionary per database. I
> suspect this is because no one has more than 2**16 repeatable unique
> strings (one dictionary limitation) in their documents. Thus there is
> no benefit in having separate dictionaries and corresponding extra
> complexity.

IMO the main benefit of having different dictionaries is that you
could have a small dictionary for small and very structured JSONB
fields (e.g. some time-series data), and a large one for large /
unstructured JSONB fields, without having the significant performance
impact of having that large and varied dictionary on the
small&structured field. Although a binary search is log(n) and thus
still quite cheap even for large dictionaries, the extra size is
certainly not free, and you'll be touching more memory in the process.

> > - Each dictionary uses a lot of memory, regardless of the number of
> > actual stored keys. For 32-bit systems the base usage of a dictionary
> > without entries ((sizeof(Word) + sizeof(uint16)) * 2**16) would be
> > almost 1MB, and for 64-bit it would be 1.7MB. That is significantly
> > more than I'd want to install.
>
> You are probably right on this one, this part could be optimized. I
> will address this if we agree on submitting the patch.
>
> > - You call gettimeofday() in both dict_get and in get_current_dict_id.
> > These functions can be called in short and tight loops (for small GSON
> > fields), in which case it would add significant overhead through the
> > implied syscalls.
>
> I must admit, I'm not an expert in this area. My understanding is that
> gettimeofday() is implemented as single virtual memory access on
> modern operating systems, e.g. VDSO on Linux, thus it's very cheap.
> I'm not that sure about other supported platforms though. Probably
> worth investigating.

Yes, but vDSO does not necessarily work on all systems: e.g. in 2017,
a lot on EC2 [1] was run using Xen with vDSO not working for
gettimeofday. I'm uncertain if this issue persists for their new
KVM/Nitro hypervisor.

> > It does mean that you're deTOASTing
> > the full GSON field, and that the stored bytestring will not be
> > structured / doesn't work well with current debuggers.
>
> Unfortunately, I'm not very well aware of debugging tools in this
> context. Could you please name the debuggers I should take into
> account?

Hmm, I was mistaken in that regard. I was under the impression that at
least one of pageinspect, pg_filedump and pg_hexedit did support
column value introspection, which they apparently do not. pg_filedump
(and thus pg_hexedit) have some introspection, but none specialized
for jsonb (yet).

The point I tried to make was that introspection of GSON would be even
more difficult due to it adding a non-standard compression method
which makes introspection effectively impossible (the algorithm can
replace things other than the strings it should replace, so it will be
difficult to retrieve structure from the encoded string).

With regards,

Matthias van de Meent

[0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html#PostgreSQL.Concepts.General.FeatureSupport.Extensions.13x
[1] https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-05-26 16:45:38 Re: Replacing pg_depend PIN entries with a fixed range check
Previous Message Peter Eisentraut 2021-05-26 16:41:02 Re: CALL versus procedures with output-only arguments