Re: [HACKERS] Custom compression methods

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Евгений Шишкин <itparanoia(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Oleg Bartunov <obartunov(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Chapman Flack <chap(at)anastigmatix(dot)net>
Subject: Re: [HACKERS] Custom compression methods
Date: 2018-04-23 09:40:33
Message-ID: add85c85-0e85-aff5-d7e6-14c99c715251@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 22.04.2018 16:21, Alexander Korotkov wrote:
> On Fri, Apr 20, 2018 at 7:45 PM, Konstantin Knizhnik
> <k(dot)knizhnik(at)postgrespro(dot)ru <mailto:k(dot)knizhnik(at)postgrespro(dot)ru>> wrote:
>
> On 30.03.2018 19:50, Ildus Kurbangaliev wrote:
>
> On Mon, 26 Mar 2018 20:38:25 +0300
> Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru
> <mailto:i(dot)kurbangaliev(at)postgrespro(dot)ru>> wrote:
>
> Attached rebased version of the patch. Fixed conflicts in
> pg_class.h.
>
> New rebased version due to conflicts in master. Also fixed few
> errors
> and removed cmdrop method since it couldnt be tested.
>
>  I seems to be useful (and not so difficult) to use custom
> compression methods also for WAL compression: replace direct calls
> of pglz_compress in xloginsert.c
>
>
> I'm going to object this at point, and I've following arguments for that:
>
> 1) WAL compression is much more critical for durability than datatype
> compression.  Imagine, compression algorithm contains a bug which
> cause decompress method to issue a segfault.  In the case of datatype
> compression, that would cause crash on access to some value which
> causes segfault; but in the rest database will be working giving you
> a chance to localize the issue and investigate that. In the case of
> WAL compression, recovery would cause a server crash. That seems
> to be much more serious disaster.  You wouldn't be able to make
> your database up and running and the same happens on the standby.
>
Well, I do not think that somebody will try to implement its own
compression algorithm...
From my point of view the main value of this patch is that it allows to
replace pglz algorithm with more efficient one, for example zstd.
At some data sets zstd provides more than 10 times better compression
ratio and at the same time is faster then pglz.
I do not think that risk of data corruption caused by WAL compression
with some alternative compression algorithm (zlib, zstd,...) is higher
than in case of using builtin Postgres compression.

> 2) Idea of custom compression method is that some columns may
> have specific data distribution, which could be handled better with
> particular compression method and particular parameters.  In the
> WAL compression you're dealing with the whole WAL stream containing
> all the values from database cluster.  Moreover, if custom compression
> method are defined for columns, then in WAL stream you've values
> already compressed in the most efficient way.  However, it might
> appear that some compression method is better for WAL in general
> case (there are benchmarks showing our pglz is not very good in
> comparison to the alternatives).  But in this case I would prefer to just
> switch our WAL to different compression method one day.  Thankfully
> we don't preserve WAL compatibility between major releases.

Frankly speaking I do not believe that somebody will use custom
compression in this way: implement its own compression methods for the
specific data type.
May be just for json/jsonb, but also only in the case when custom
compression API allows to separately store compression dictionary (which
as far as I understand is not currently supported).

When I worked for SciDB (database for scientists which has to deal
mostly with multidimensional arrays of data) our first intention was to
implement custom compression methods for the particular data types and
data distributions. For example, there are very fast, simple and
efficient algorithms for encoding sequence of monotonically increased
integers, ....
But after several experiments we rejected this idea and switch to using
generic compression methods. Mostly because we do not want compressor to
know much about page layout, data type representation,... In Postgres,
from my point of view,  we have similar situation. Assume that we have
column of serial type. So it is good candidate of compression, isn't it?
But this approach deals only with particular attribute values. It can
not take any advantages from the fact that this particular column is
monotonically increased. It can be done only with page level
compression, but it is a different story.

So current approach works only for blob-like types: text, json,... But
them usually have quite complex internal structure and for them
universal compression algorithms used to be more efficient than any
hand-written specific implementation. Also algorithms like zstd, are
able to efficiently recognize and compress many common data
distributions, line monotonic sequences, duplicates, repeated series,...

>
> 3) This patch provides custom compression methods recorded in
> the catalog.  During recovery you don't have access to the system
> catalog, because it's not recovered yet, and can't fetch compression
> method metadata from there.  The possible thing is to have GUC,
> which stores shared module and function names for WAL compression.
> But that seems like quite different mechanism from the one present
> in this patch.
>
I do not think that assignment default compression method through GUC is
so bad idea.

> Taking into account all of above, I think we would give up with custom
> WAL compression method.  Or, at least, consider it unrelated to this
> patch.
>

Sorry for repeating the same thing, but from my point of view the main
advantage of this patch is that it allows to replace pglz with more
efficient compression algorithms.
I do not see much sense in specifying custom compression method for some
particular columns.
It will be more useful from my point of view to include in this patch
implementation of compression API not only or pglz, but also for zlib,
zstd and may be for some other popular compressing libraries which
proved their efficiency.

Postgres already has zlib dependency (unless explicitly excluded with
--without-zlib), so zlib implementation can be included in Postgres build.
Other implementations can be left as module which customer can build
himself. It is certainly less convenient, than using preexistred stuff,
but much more convenient then making users to write this code themselves.

There is yet another aspect which is not covered by this patch:
streaming compression.
Streaming compression is needed if we want to compress libpq traffic. It
can be very efficient for COPY command and for replication. Also libpq
compression can improve speed of queries returning large results (for
example containing JSON columns) throw slow network.
I have  proposed such patch for libpq, which is using either zlib,
either zstd streaming API. Postgres built-in compression implementation
doesn't have streaming API at all, so it can not be used here. Certainly
support of streaming  may significantly complicates compression API, so
I am not sure that it actually needed to be included in this patch.
But I will be pleased if Ildus can consider this idea.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2018-04-23 09:41:28 minor fix for CloneRowTriggersToPartition
Previous Message Ildus Kurbangaliev 2018-04-23 09:19:09 Re: [HACKERS] Custom compression methods