Re: jsonb format is pessimal for toast compression

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Larry White <ljw1001(at)gmail(dot)com>
Subject: Re: jsonb format is pessimal for toast compression
Date: 2014-08-08 14:21:46
Message-ID: 53E4DCFA.9020305@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 08/07/2014 11:17 PM, Tom Lane wrote:
> I looked into the issue reported in bug #11109. The problem appears to be
> that jsonb's on-disk format is designed in such a way that the leading
> portion of any JSON array or object will be fairly incompressible, because
> it consists mostly of a strictly-increasing series of integer offsets.
> This interacts poorly with the code in pglz_compress() that gives up if
> it's found nothing compressible in the first first_success_by bytes of a
> value-to-be-compressed. (first_success_by is 1024 in the default set of
> compression parameters.)

[snip]

> There is plenty of compressible data once we get into the repetitive
> strings in the payload part --- but that starts at offset 944, and up to
> that point there is nothing that pg_lzcompress can get a handle on. There
> are, by definition, no sequences of 4 or more repeated bytes in that area.
> I think in principle pg_lzcompress could decide to compress the 3-byte
> sequences consisting of the high-order 24 bits of each offset; but it
> doesn't choose to do so, probably because of the way its lookup hash table
> works:
>
> * pglz_hist_idx -
> *
> * Computes the history table slot for the lookup by the next 4
> * characters in the input.
> *
> * NB: because we use the next 4 characters, we are not guaranteed to
> * find 3-character matches; they very possibly will be in the wrong
> * hash list. This seems an acceptable tradeoff for spreading out the
> * hash keys more.
>
> For jsonb header data, the "next 4 characters" are *always* different, so
> only a chance hash collision can result in a match. There is therefore a
> pretty good chance that no compression will occur before it gives up
> because of first_success_by.
>
> I'm not sure if there is any easy fix for this. We could possibly change
> the default first_success_by value, but I think that'd just be postponing
> the problem to larger jsonb objects/arrays, and it would hurt performance
> for genuinely incompressible data. A somewhat painful, but not yet
> out-of-the-question, alternative is to change the jsonb on-disk
> representation. Perhaps the JEntry array could be defined as containing
> element lengths instead of element ending offsets. Not sure though if
> that would break binary searching for JSON object keys.
>
>

Ouch.

Back when this structure was first presented at pgCon 2013, I wondered
if we shouldn't extract the strings into a dictionary, because of key
repetition, and convinced myself that this shouldn't be necessary
because in significant cases TOAST would take care of it.

Maybe we should have pglz_compress() look at the *last* 1024 bytes if it
can't find anything worth compressing in the first, for values larger
than a certain size.

It's worth noting that this is a fairly pathological case. AIUI the
example you constructed has an array with 100k string elements. I don't
think that's typical. So I suspect that unless I've misunderstood the
statement of the problem we're going to find that almost all the jsonb
we will be storing is still compressible.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message MauMau 2014-08-08 14:30:25 Re: [RFC] Should smgrtruncate() avoid sending sinval message for temp relations
Previous Message Ramirez, Danilo 2014-08-08 14:00:09 Re: PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong?