Re: Proposal: custom compression methods

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: custom compression methods
Date: 2015-12-16 12:17:52
Message-ID: 56715670.1000304@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 12/14/2015 12:51 PM, Simon Riggs wrote:
> On 13 December 2015 at 17:28, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru <mailto:a(dot)korotkov(at)postgrespro(dot)ru>> wrote:
>
> it would be nice to make compression methods pluggable.
>
>
> Agreed.
>
> My thinking is that this should be combined with work to make use of
> the compressed data, which is why Alvaro, Tomas, David have been
> working on Col Store API for about 18 months and work on that
> continues with more submissions for 9.6 due.

I'm not sure it makes sense to combine those two uses of compression,
because there are various differences - some subtle, some less subtle.
It's a bit difficult to discuss this without any column store
background, but I'll try anyway.

The compression methods discussed in this thread, used to compress a
single varlena value, are "general-purpose" in the sense that they
operate on opaque stream of bytes, without any additional context (e.g.
about structure of the data being compressed). So essentially the
methods have an API like this:

int compress(char *src, int srclen, char *dst, int dstlen);
int decompress(char *src, int srclen, char *dst, int dstlen);

And possibly some auxiliary methods like "estimate compressed length"
and such.

OTOH the compression methods we're messing with while working on the
column store are quite different - they operate on columns (i.e. "arrays
of Datums"). Also, column stores prefer "light-weight" compression
methods like RLE or DICT (dictionary compression) because those methods
allow execution on compressed data when done properly. Which for example
requires additional info about the data type in the column, so that the
RLE groups match the data type length.

So the API of those methods looks quite different, compared to the
general-purpose methods. Not only the compression/decompression methods
will have additional parameters with info about the data type, but
there'll be methods used for iterating over values in the compressed
data etc.

Of course, it'd be nice to have the ability to add/remove even those
light-weight methods, but I'm not sure it makes sense to squash them
into the same catalog. I can imagine a catalog suitable for both APIs
(essentially having two groups of columns, one for each type of
compression algorithm), but I can't really imagine a compression method
providing both interfaces at the same time.

In any case, I don't think this is the main challenge the patch needs to
solve at this point.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stas Kelvich 2015-12-16 12:26:35 Re: Cube extension kNN support
Previous Message Fabien COELHO 2015-12-16 11:54:11 Re: pgbench stats per script & other stuff