Re: [HACKERS] Custom compression methods

From: Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Custom compression methods
Date: 2017-12-12 11:41:42
Message-ID: 20171212144142.4bfc7aac@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 11 Dec 2017 20:53:29 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> But let me play the devil's advocate for a while and question the
> usefulness of this approach to compression. Some of the questions were
> mentioned in the thread before, but I don't think they got the
> attention they deserve.

Hi. I will try to explain why this approach could be better than others.

>
>
> Replacing the algorithm used to compress all varlena values (in a way
> that makes it transparent for the data type code).
> ----------------------------------------------------------------------
>
> While pglz served us well over time, it was repeatedly mentioned that
> in some cases it becomes the bottleneck. So supporting other state of
> the art compression algorithms seems like a good idea, and this patch
> is one way to do that.
>
> But perhaps we should simply make it an initdb option (in which case
> the whole cluster would simply use e.g. lz4 instead of pglz)?
>
> That seems like a much simpler approach - it would only require some
> ./configure options to add --with-lz4 (and other compression
> libraries), an initdb option to pick compression algorithm, and
> probably noting the choice in cluster controldata.

Replacing pglz for all varlena values wasn't the goal of the patch, but
it's possible to do with it and I think that's good. And as Robert
mentioned pglz could appear as builtin undroppable compresssion method
so the others could be added too. And in the future it can open the
ways to specify compression for specific database or cluster.

>
> Custom datatype-aware compression (e.g. the tsvector)
> ----------------------------------------------------------------------
>
> Exploiting knowledge of the internal data type structure is a
> promising way to improve compression ratio and/or performance.
>
> The obvious question of course is why shouldn't this be done by the
> data type code directly, which would also allow additional benefits
> like operating directly on the compressed values.
>
> Another thing is that if the datatype representation changes in some
> way, the compression method has to change too. So it's tightly coupled
> to the datatype anyway.
>
> This does not really require any new infrastructure, all the pieces
> are already there.
>
> In some cases that may not be quite possible - the datatype may not be
> flexible enough to support alternative (compressed) representation,
> e.g. because there are no bits available for "compressed" flag, etc.
>
> Conclusion: IMHO if we want to exploit the knowledge of the data type
> internal structure, perhaps doing that in the datatype code directly
> would be a better choice.

It could be, but let's imagine there will be internal compression for
tsvector. It means that tsvector has two formats now and minus one bit
somewhere in the header. After a while we found a better compression
but we can't add it because there is already one and it's not good to
have three different formats for one type. Or, the compression methods
were implemented and we decided to use dictionaries for tsvector (if
the user going to store limited number of words). But it will mean that
tsvector will go two compression stages (for its internal and for
dictionaries).

>
>
> Custom datatype-aware compression with additional column-specific
> metadata (e.g. the jsonb with external dictionary).
> ----------------------------------------------------------------------
>
> Exploiting redundancy in multiple values in the same column (instead
> of compressing them independently) is another attractive way to help
> the compression. It is inherently datatype-aware, but currently can't
> be implemented directly in datatype code as there's no concept of
> column-specific storage (e.g. to store dictionary shared by all values
> in a particular column).
>
> I believe any patch addressing this use case would have to introduce
> such column-specific storage, and any solution doing that would
> probably need to introduce the same catalogs, etc.
>
> The obvious disadvantage of course is that we need to decompress the
> varlena value before doing pretty much anything with it, because the
> datatype is not aware of the compression.
>
> So I wonder if the patch should instead provide infrastructure for
> doing that in the datatype code directly.
>
> The other question is if the patch should introduce some
> infrastructure for handling the column context (e.g. column
> dictionary). Right now, whoever implements the compression has to
> implement this bit too.

Column specific storage sounds optional to me. For example compressing
timestamp[] using some delta compression will not require it.

--
----
Regards,
Ildus Kurbangaliev

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-12-12 11:50:22 Re: Parallel Index Scan vs BTP_DELETED and BTP_HALF_DEAD
Previous Message Stefan Keller 2017-12-12 11:38:30 Learned Indexes in PostgreSQL?