Re: Proposal: custom compression methods

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: custom compression methods
Date: 2015-12-14 06:50:57
Message-ID: CAMsr+YGiN7davH54QVyaMnpQJyuO_AkbVZe6s71U-qJmwbJt3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 14 December 2015 at 01:28, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
wrote:

> Hackers,
>
> I'd like to propose a new feature: "Custom compression methods".
>

Are you aware of the past work in this area? There's quite a bit of history
and I strongly advise you to read the relevant threads to make sure you
don't run into the same problems.

See:

http://www.postgresql.org/message-id/flat/20130615102028(dot)GK19500(at)alap2(dot)anarazel(dot)de#20130615102028(dot)GK19500@alap2.anarazel.de

for at least one of the prior attempts.

> *Motivation*
>
> Currently when datum doesn't fit the page PostgreSQL tries to compress it
> using PGLZ algorithm. Compression of particular attributes could be turned
> on/off by tuning storage parameter of column. Also, there is heuristics
> that datum is not compressible when its first KB is not compressible. I can
> see following reasons for improving this situation.
>

Yeah, recent discussion has made it clear that there's room for improving
how and when TOAST compresses things. Per-attribute compression thresholds
made a lot of sense.

Therefore, it would be nice to make compression methods pluggable.
>

Very important issues to consider here is on-disk format stability, space
overhead, and pg_upgrade-ability. It looks like you have addressed all of
these issues below by making compression methods per-column not per-Datum
and forcing a full table rewrite to change it.

The issue with per-Datum is that TOAST claims two bits of a varlena header,
which already limits us to 1 GiB varlena values, something people are
starting to find to be a problem. There's no wiggle room to steal more
bits. If you want pluggable compression you need a way to store knowledge
of how a given datum is compressed with the datum or have a fast, efficient
way to check.

pg_upgrade means you can't just redefine the current toast bits so the
compressed bit means "data is compressed, check first byte of varlena data
for algorithm" because existing data won't have that, the first byte will
be the start of the compressed data stream.

There's also the issue of what you do when the algorithm used for a datum
is no longer loaded. I don't care so much about that one, I'm happy to say
"you ERROR and tell the user to fix the situation". But I think some people
were concerned about that too, or being stuck with algorithms forever once
they're added.

Looks like you've dealt with all those concerns.

> DROP COMPRESSION METHOD compname;
>
>
When you drop a compression method what happens to data compressed with
that method?

If you re-create it can the data be associated with the re-created method?

> Compression method of column would be stored in pg_attribute table.
>

So you can't change it without a full table rewrite, but thus you also
don't have to poach any TOAST header bits to determine which algorithm is
used. And you can use pg_depend to prevent dropping a compression method
still in use by a table. Makes sense.

Looks promising, but I haven't re-read the old thread in detail to see if
this approach was already considered and rejected.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-12-14 06:56:53 Re: pgbench stats per script & other stuff
Previous Message Vladimir Sitnikov 2015-12-14 06:48:00 Re: W-TinyLfu for cache eviction