Re: jsonb format is pessimal for toast compression

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Larry White <ljw1001(at)gmail(dot)com>
Subject: Re: jsonb format is pessimal for toast compression
Date: 2014-08-09 00:15:05
Message-ID: 20140809001505.GN16422@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Stephen Frost <sfrost(at)snowman(dot)net> writes:
> > * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> >> I looked into the issue reported in bug #11109. The problem appears to be
> >> that jsonb's on-disk format is designed in such a way that the leading
> >> portion of any JSON array or object will be fairly incompressible, because
> >> it consists mostly of a strictly-increasing series of integer offsets.
> >> This interacts poorly with the code in pglz_compress() that gives up if
> >> it's found nothing compressible in the first first_success_by bytes of a
> >> value-to-be-compressed. (first_success_by is 1024 in the default set of
> >> compression parameters.)
>
> > I haven't looked at this in any detail, so take this with a grain of
> > salt, but what about teaching pglz_compress about using an offset
> > farther into the data, if the incoming data is quite a bit larger than
> > 1k? This is just a test to see if it's worthwhile to keep going, no?
>
> Well, the point of the existing approach is that it's a *nearly free*
> test to see if it's worthwhile to keep going; there's just one if-test
> added in the outer loop of the compression code. (cf commit ad434473ebd2,
> which added that along with some other changes.) AFAICS, what we'd have
> to do to do it as you suggest would to execute compression on some subset
> of the data and then throw away that work entirely. I do not find that
> attractive, especially when for most datatypes there's no particular
> reason to look at one subset instead of another.

Ah, I see- we were using the first block as it means we can reuse the
work done on it if we decide to continue with the compression. Makes
sense. We could possibly arrange to have the amount attempted depend on
the data type, but you point out that we can't do that without teaching
lower components about types, which is less than ideal.

What about considering how large the object is when we are analyzing if
it compresses well overall? That is- for a larger object, make a larger
effort to compress it. There's clearly a pessimistic case which could
arise from that, but it may be better than the current situation.
There's a clear risk that such an algorithm may well be very type
specific, meaning that we make things worse for some types (eg: bytea's
which end up never compressing well we'd likely spend more CPU time
trying than we do today).

> 1. The real problem here is that jsonb is emitting quite a bit of
> fundamentally-nonrepetitive data, even when the user-visible input is very
> repetitive. That's a compression-unfriendly transformation by anyone's
> measure. Assuming that some future replacement for pg_lzcompress() will
> nonetheless be able to compress the data strikes me as mostly wishful
> thinking. Besides, we'd more than likely have a similar early-exit rule
> in any substitute implementation, so that we'd still be at risk even if
> it usually worked.

I agree that jsonb ends up being nonrepetitive in part, which is why
I've been trying to push the discussion in the direction of making it
more likely for the highly-compressible data to be considered rather
than the start of the jsonb object. I don't care for our compression
algorithm having to be catered to in this regard in general though as
the exact same problem could, and likely does, exist in some real life
bytea-using PG implementations.

I disagree that another algorithm wouldn't be able to manage better on
this data than pglz. pglz, from my experience, is notoriously bad a
certain data sets which other algorithms are not as poorly impacted by.

> 2. Are we going to ship 9.4 without fixing this? I definitely don't see
> replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
> jsonb is still within the bounds of reason.

I'd really hate to ship 9.4 without a fix for this, but I have a similar
hard time with shipping 9.4 without the binary search component..

> Considering all the hype that's built up around jsonb, shipping a design
> with a fundamental performance handicap doesn't seem like a good plan
> to me. We could perhaps band-aid around it by using different compression
> parameters for jsonb, although that would require some painful API changes
> since toast_compress_datum() doesn't know what datatype it's operating on.

I don't like the idea of shipping with this handicap either.

Perhaps another options would be a new storage type which basically says
"just compress it, no matter what"? We'd be able to make that the
default for jsonb columns too, no? Again- I'll admit this is shooting
from the hip, but I wanted to get these thoughts out and I won't have
much more time tonight.

Thanks!

Stephen

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-08-09 00:24:39 Re: Hokey wrong versions of libpq in apt.postgresql.org
Previous Message Tom Lane 2014-08-09 00:13:33 Re: Defining a foreign key with a duplicate column is broken