Re: Add ZSON extension to /contrib/

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Aleksander Alekseev <aleksander(at)timescale(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: Add ZSON extension to /contrib/
Date: 2021-05-28 14:22:26
Message-ID: 09ec9e70-d901-40a6-cbca-b0e375eebb73@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 5/28/21 6:35 AM, Tomas Vondra wrote:
>
>>
>> IMO the main benefit of having different dictionaries is that you
>> could have a small dictionary for small and very structured JSONB
>> fields (e.g. some time-series data), and a large one for large /
>> unstructured JSONB fields, without having the significant performance
>> impact of having that large and varied dictionary on the
>> small&structured field. Although a binary search is log(n) and thus
>> still quite cheap even for large dictionaries, the extra size is
>> certainly not free, and you'll be touching more memory in the process.
>>
> I'm sure we can think of various other arguments for allowing separate
> dictionaries. For example, what if you drop a column? With one huge
> dictionary you're bound to keep the data forever. With per-column dicts
> you can just drop the dict and free disk space / memory.
>
> I also find it hard to believe that no one needs 2**16 strings. I mean,
> 65k is not that much, really. To give an example, I've been toying with
> storing bitcoin blockchain in a database - one way to do that is storing
> each block as a single JSONB document. But each "item" (eg. transaction)
> is identified by a unique hash, so that means (tens of) thousands of
> unique strings *per document*.
>
> Yes, it's a bit silly and extreme, and maybe the compression would not
> help much in this case. But it shows that 2**16 is damn easy to hit.
>
> In other words, this seems like a nice example of survivor bias, where
> we only look at cases for which the existing limitations are acceptable,
> ignoring the (many) remaining cases eliminated by those limitations.
>
>

I don't think we should lightly discard the use of 2 byte keys though.
Maybe we could use a scheme similar to what we use for text lengths,
where the first bit indicates whether we have a 1 byte or 4 byte length
indicator. Many dictionaries will have less that 2^15-1 entries, so they
would use exclusively the smaller keys.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-05-28 15:30:49 Re: Command statistics system (cmdstats)
Previous Message Johannes Graën 2021-05-28 14:12:33 Degression (PG10 > 11, 12 or 13)