Re: [HACKERS] Custom compression methods

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject: Re: [HACKERS] Custom compression methods
Date: 2017-12-01 15:18:49
Message-ID: a6fe2ee1-7f0e-67d1-7c5d-5075c17191d6@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/01/2017 03:23 PM, Robert Haas wrote:
> On Thu, Nov 30, 2017 at 2:47 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> OK. I think it's a nice use case (and nice gains on the compression
>> ratio), demonstrating the datatype-aware compression. The question is
>> why shouldn't this be built into the datatypes directly?
>
> Tomas, thanks for running benchmarks of this. I was surprised to see
> how little improvement there was from other modern compression
> methods, although lz4 did appear to be a modest win on both size and
> speed. But I share your intuition that a lot of the interesting work
> is in datatype-specific compression algorithms. I have noticed in a
> number of papers that I've read that teaching other parts of the
> system to operate directly on the compressed data, especially for
> column stores, is a critical performance optimization; of course, that
> only makes sense if the compression is datatype-specific. I don't
> know exactly what that means for the design of this patch, though.
>

It has very little impact on this patch, as it has nothing to do with
columnar storage. That is, each value is compressed independently.

Column stores exploit the fact that they get a vector of values,
compressed in some data-aware way. E.g. some form of RLE or dictionary
compression, which allows them to evaluate expressions on the compressed
vector. But that's irrelevant here, we only get row-by-row execution.

Note: The idea to build dictionary for the whole jsonb column (which
this patch should allow) does not make it "columnar compression" in the
"column store" way. The executor will still get the decompressed value.

> As a general point, no matter which way you go, you have to somehow
> deal with on-disk compatibility. If you want to build in compression
> to the datatype itself, you need to find at least one bit someplace to
> mark the fact that you applied built-in compression. If you want to
> build it in as a separate facility, you need to denote the compression
> used someplace else. I haven't looked at how this patch does it, but
> the proposal in the past has been to add a value to vartag_external.

AFAICS the patch does that by setting a bit in the varlena header, and
then adding OID of the compression method after the varlena header. So
you get (verlena header + OID + data).

This has good and bad consequences.

Good: It's transparent for the datatype, so it does not have to worry
about the custom compression at all (and it may change arbitrarily).

Bad: It's transparent for the datatype, so it can't operate directly on
the compressed representation.

I don't think this is an argument against the patch, though. If the
datatype can support intelligent compression (and execution without
decompression), it has to be done in the datatype anyway.

> One nice thing about the latter method is that it can be used for any
> data type generically, regardless of how much bit-space is available
> in the data type representation itself. It's realistically hard to
> think of a data-type that has no bit space available anywhere but is
> still subject to data-type specific compression; bytea definitionally
> has no bit space but is also can't benefit from special-purpose
> compression, whereas even something like text could be handled by
> starting the varlena with a NUL byte to indicate compressed data
> following. However, you'd have to come up with a different trick for
> each data type. Piggybacking on the TOAST machinery avoids that. It
> also implies that we only try to compress values that are "big", which
> is probably be desirable if we're talking about a kind of compression
> that makes comprehending the value slower. Not all types of
> compression do, cf. commit 145343534c153d1e6c3cff1fa1855787684d9a38,
> and for those that don't it probably makes more sense to just build it
> into the data type.
>
> All of that is a somewhat separate question from whether we should
> have CREATE / DROP COMPRESSION, though (or Alvaro's proposal of using
> the ACCESS METHOD stuff instead). Even if we agree that piggybacking
> on TOAST is a good way to implement pluggable compression methods, it
> doesn't follow that the compression method is something that should be
> attached to the datatype from the outside; it could be built into it
> in a deep way. For example, "packed" varlenas (1-byte header) are a
> form of compression, and the default functions for detoasting always
> produced unpacked values, but the operators for the text data type
> know how to operate on the packed representation. That's sort of a
> trivial example, but it might well be that there are other cases where
> we can do something similar. Maybe jsonb, for example, can compress
> data in such a way that some of the jsonb functions can operate
> directly on the compressed representation -- perhaps the number of
> keys is easily visible, for example, or maybe more. In this view of
> the world, each data type should get to define its own compression
> method (or methods) but they are hard-wired into the datatype and you
> can't add more later, or if you do, you lose the advantages of the
> hard-wired stuff.
>

I agree with these thoughts in general, but I'm not quite sure what is
your conclusion regarding the patch.

The patch allows us to define custom compression methods that are
entirely transparent for the datatype machinery, i.e. allow compression
even for data types that did not consider compression at all. That seems
valuable to me.

Of course, if the same compression logic can be built into the datatype
itself, it may allow additional benefits (like execution on compressed
data directly).

I don't see these two approaches as conflicting.

>
> BTW, another related concept that comes up a lot in discussions of
> this area is that we could do a lot better compression of columns if
> we had some place to store a per-column dictionary. I don't really
> know how to make that work. We could have a catalog someplace that
> stores an opaque blob for each column configured to use a compression
> method, and let the compression method store whatever it likes in
> there. That's probably fine if you are compressing the whole table at
> once and the blob is static thereafter. But if you want to update
> that blob as you see new column values there seem to be almost
> insurmountable problems.
>

Well, that's kinda the idea behind the configure/drop methods in the
compression handler, and Ildus already did implement such dictionary
compression for the jsonb data type, see:

https://github.com/postgrespro/jsonbd

Essentially that stores the dictionary in a table, managed by a bunch of
background workers.

>
> To be clear, I'm not trying to load this patch down with a requirement
> to solve every problem in the universe. On the other hand, I think it
> would be easy to beat a patch like this into shape in a fairly
> mechanical way and then commit-and-forget. That might be leaving a
> lot of money on the table; I'm glad you are thinking about the bigger
> picture and hope that my thoughts here somehow contribute.
>

Thanks ;-)

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-12-01 15:36:02 Re: [HACKERS] proposal: psql command \graw
Previous Message Masahiko Sawada 2017-12-01 15:14:54 Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager