Proposal: custom compression methods

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Proposal: custom compression methods
Date: 2015-12-13 17:28:08
Message-ID: CAPpHfdsdTA5uZeq6MNXL5ZRuNx+Sig4ykWzWEAfkC6ZKMDy6=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hackers,

I'd like to propose a new feature: "Custom compression methods".

*Motivation*

Currently when datum doesn't fit the page PostgreSQL tries to compress it
using PGLZ algorithm. Compression of particular attributes could be turned
on/off by tuning storage parameter of column. Also, there is heuristics
that datum is not compressible when its first KB is not compressible. I can
see following reasons for improving this situation.

* Heuristics used for detection of compressible data could be not optimal.
We already met this situation with jsonb.
* For some data distributions there could be more effective compression
methods than PGLZ. For example:
* For natural languages we could use predefined dictionaries which
would allow us to compress even relatively short strings (which are not
long enough for PGLZ to train its dictionary).
* For jsonb/hstore we could implement compression methods which have
dictionary of keys. This could be either static predefined dictionary or
dynamically appended dictionary with some storage.
* For jsonb and other container types we can implement compression
methods which would allow extraction of particular fields without
decompression of full value.

Therefore, it would be nice to make compression methods pluggable.

*Design*

Compression methods would be stored in pg_compress system catalog table of
following structure:

compname name
comptype oid
compcompfunc regproc
compdecompfunc regproc

Compression methods could be created by "CREATE COMPRESSION METHOD" command
and deleted by "DROP COMPRESSION METHOD" command.

CREATE COMPRESSION METHOD compname [FOR TYPE comptype_name]
WITH COMPRESS FUNCTION compcompfunc_name
DECOMPRESS FUNCTION compdecompfunc_name;
DROP COMPRESSION METHOD compname;

Signatures of compcompfunc and compdecompfunc would be similar
pglz_compress and pglz_decompress except compression strategy. There is
only one compression strategy in use for pglz (PGLZ_strategy_default).
Thus, I'm not sure it would be useful to provide multiple strategies for
compression methods.

extern int32 compcompfunc(const char *source, int32 slen, char *dest);
extern int32 compdecompfunc(const char *source, int32 slen, char *dest,
int32 rawsize);

Compression method could be type-agnostic (comptype = 0) or type specific
(comptype != 0). Default compression method is PGLZ.

Compression method of column would be stored in pg_attribute table.
Dependencies between columns and compression methods would be tracked in
pg_depend preventing dropping compression method which is currently in use.
Compression method of the attribute could be altered by ALTER TABLE command.

ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD
compname;

Since mixing of different compression method in the same attribute would be
hard to manage (especially dependencies tracking), altering attribute
compression method would require a table rewrite.

*Implementation details*

Catalog changes, new commands, dependency tracking etc are mostly
mechanical stuff with no fundamental problems. The hardest part seems to be
providing seamless integration of custom compression methods into existing
code.

It doesn't seems hard to add extra parameter with compression method to
toast_compress_datum. However, PG_DETOAST_DATUM should call custom
decompress function with only knowledge of datum. That means that we should
somehow conceal knowledge of compression method into datum. The solution
could be putting compression method oid right after varlena header. Putting
this on-disk would cause storage overhead and break backward compatibility.
Thus, we can add this oid right after reading datum from the page. This
could be the weakest point in the whole proposal and I'll be very glad for
better ideas.

P.S. I'd like to thank Petr Korobeinikov <pkorobeinikov(at)gmail(dot)com> who
started work on this patch and sent me draft of proposal in Russian.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-12-13 19:38:05 Re: PATCH: add pg_current_xlog_flush_location function
Previous Message Tom Lane 2015-12-13 17:05:30 Re: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)