Re: jsonb format is pessimal for toast compression

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Larry White <ljw1001(at)gmail(dot)com>
Subject: Re: jsonb format is pessimal for toast compression
Date: 2014-08-08 15:18:31
Message-ID: 10350.1407511111@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 08/07/2014 11:17 PM, Tom Lane wrote:
>> I looked into the issue reported in bug #11109. The problem appears to be
>> that jsonb's on-disk format is designed in such a way that the leading
>> portion of any JSON array or object will be fairly incompressible, because
>> it consists mostly of a strictly-increasing series of integer offsets.

> Ouch.

> Back when this structure was first presented at pgCon 2013, I wondered
> if we shouldn't extract the strings into a dictionary, because of key
> repetition, and convinced myself that this shouldn't be necessary
> because in significant cases TOAST would take care of it.

That's not really the issue here, I think. The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.

> Maybe we should have pglz_compress() look at the *last* 1024 bytes if it
> can't find anything worth compressing in the first, for values larger
> than a certain size.

Not possible with anything like the current implementation, since it's
just an on-the-fly status check not a trial compression.

> It's worth noting that this is a fairly pathological case. AIUI the
> example you constructed has an array with 100k string elements. I don't
> think that's typical. So I suspect that unless I've misunderstood the
> statement of the problem we're going to find that almost all the jsonb
> we will be storing is still compressible.

Actually, the 100K-string example I constructed *did* compress. Larry's
example that's not compressing is only about 12kB. AFAICS, the threshold
for trouble is in the vicinity of 256 array or object entries (resulting
in a 1kB JEntry array). That doesn't seem especially high. There is a
probabilistic component as to whether the early-exit case will actually
fire, since any chance hash collision will probably result in some 3-byte
offset prefix getting compressed. But the fact that a beta tester tripped
over this doesn't leave me with a warm feeling about the odds that it
won't happen much in the field.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-08-08 15:29:31 Re: replication commands and log_statements
Previous Message Heikki Linnakangas 2014-08-08 15:03:54 Re: Minmax indexes