From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgreSQL(dot)org, Larry White <ljw1001(at)gmail(dot)com> |
Subject: | Re: jsonb format is pessimal for toast compression |
Date: | 2014-08-08 14:21:46 |
Message-ID: | 53E4DCFA.9020305@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 08/07/2014 11:17 PM, Tom Lane wrote:
> I looked into the issue reported in bug #11109. The problem appears to be
> that jsonb's on-disk format is designed in such a way that the leading
> portion of any JSON array or object will be fairly incompressible, because
> it consists mostly of a strictly-increasing series of integer offsets.
> This interacts poorly with the code in pglz_compress() that gives up if
> it's found nothing compressible in the first first_success_by bytes of a
> value-to-be-compressed. (first_success_by is 1024 in the default set of
> compression parameters.)
[snip]
> There is plenty of compressible data once we get into the repetitive
> strings in the payload part --- but that starts at offset 944, and up to
> that point there is nothing that pg_lzcompress can get a handle on. There
> are, by definition, no sequences of 4 or more repeated bytes in that area.
> I think in principle pg_lzcompress could decide to compress the 3-byte
> sequences consisting of the high-order 24 bits of each offset; but it
> doesn't choose to do so, probably because of the way its lookup hash table
> works:
>
> * pglz_hist_idx -
> *
> * Computes the history table slot for the lookup by the next 4
> * characters in the input.
> *
> * NB: because we use the next 4 characters, we are not guaranteed to
> * find 3-character matches; they very possibly will be in the wrong
> * hash list. This seems an acceptable tradeoff for spreading out the
> * hash keys more.
>
> For jsonb header data, the "next 4 characters" are *always* different, so
> only a chance hash collision can result in a match. There is therefore a
> pretty good chance that no compression will occur before it gives up
> because of first_success_by.
>
> I'm not sure if there is any easy fix for this. We could possibly change
> the default first_success_by value, but I think that'd just be postponing
> the problem to larger jsonb objects/arrays, and it would hurt performance
> for genuinely incompressible data. A somewhat painful, but not yet
> out-of-the-question, alternative is to change the jsonb on-disk
> representation. Perhaps the JEntry array could be defined as containing
> element lengths instead of element ending offsets. Not sure though if
> that would break binary searching for JSON object keys.
>
>
Ouch.
Back when this structure was first presented at pgCon 2013, I wondered
if we shouldn't extract the strings into a dictionary, because of key
repetition, and convinced myself that this shouldn't be necessary
because in significant cases TOAST would take care of it.
Maybe we should have pglz_compress() look at the *last* 1024 bytes if it
can't find anything worth compressing in the first, for values larger
than a certain size.
It's worth noting that this is a fairly pathological case. AIUI the
example you constructed has an array with 100k string elements. I don't
think that's typical. So I suspect that unless I've misunderstood the
statement of the problem we're going to find that almost all the jsonb
we will be storing is still compressible.
cheers
andrew
From | Date | Subject | |
---|---|---|---|
Next Message | MauMau | 2014-08-08 14:30:25 | Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations |
Previous Message | Ramirez, Danilo | 2014-08-08 14:00:09 | Re: PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong? |