|From:||Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>|
|Subject:||Proposal: custom compression methods|
|Views:||Raw Message | Whole Thread | Download mbox | Resend email|
I'd like to propose a new feature: "Custom compression methods".
Currently when datum doesn't fit the page PostgreSQL tries to compress it
using PGLZ algorithm. Compression of particular attributes could be turned
on/off by tuning storage parameter of column. Also, there is heuristics
that datum is not compressible when its first KB is not compressible. I can
see following reasons for improving this situation.
* Heuristics used for detection of compressible data could be not optimal.
We already met this situation with jsonb.
* For some data distributions there could be more effective compression
methods than PGLZ. For example:
* For natural languages we could use predefined dictionaries which
would allow us to compress even relatively short strings (which are not
long enough for PGLZ to train its dictionary).
* For jsonb/hstore we could implement compression methods which have
dictionary of keys. This could be either static predefined dictionary or
dynamically appended dictionary with some storage.
* For jsonb and other container types we can implement compression
methods which would allow extraction of particular fields without
decompression of full value.
Therefore, it would be nice to make compression methods pluggable.
Compression methods would be stored in pg_compress system catalog table of
Compression methods could be created by "CREATE COMPRESSION METHOD" command
and deleted by "DROP COMPRESSION METHOD" command.
CREATE COMPRESSION METHOD compname [FOR TYPE comptype_name]
WITH COMPRESS FUNCTION compcompfunc_name
DECOMPRESS FUNCTION compdecompfunc_name;
DROP COMPRESSION METHOD compname;
Signatures of compcompfunc and compdecompfunc would be similar
pglz_compress and pglz_decompress except compression strategy. There is
only one compression strategy in use for pglz (PGLZ_strategy_default).
Thus, I'm not sure it would be useful to provide multiple strategies for
extern int32 compcompfunc(const char *source, int32 slen, char *dest);
extern int32 compdecompfunc(const char *source, int32 slen, char *dest,
Compression method could be type-agnostic (comptype = 0) or type specific
(comptype != 0). Default compression method is PGLZ.
Compression method of column would be stored in pg_attribute table.
Dependencies between columns and compression methods would be tracked in
pg_depend preventing dropping compression method which is currently in use.
Compression method of the attribute could be altered by ALTER TABLE command.
ALTER TABLE table_name ALTER COLUMN column_name SET COMPRESSION METHOD
Since mixing of different compression method in the same attribute would be
hard to manage (especially dependencies tracking), altering attribute
compression method would require a table rewrite.
Catalog changes, new commands, dependency tracking etc are mostly
mechanical stuff with no fundamental problems. The hardest part seems to be
providing seamless integration of custom compression methods into existing
It doesn't seems hard to add extra parameter with compression method to
toast_compress_datum. However, PG_DETOAST_DATUM should call custom
decompress function with only knowledge of datum. That means that we should
somehow conceal knowledge of compression method into datum. The solution
could be putting compression method oid right after varlena header. Putting
this on-disk would cause storage overhead and break backward compatibility.
Thus, we can add this oid right after reading datum from the page. This
could be the weakest point in the whole proposal and I'll be very glad for
P.S. I'd like to thank Petr Korobeinikov <pkorobeinikov(at)gmail(dot)com> who
started work on this patch and sent me draft of proposal in Russian.
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
|Next Message||Tomas Vondra||2015-12-13 19:38:05||Re: PATCH: add pg_current_xlog_flush_location function|
|Previous Message||Tom Lane||2015-12-13 17:05:30||Re: Using a single standalone-backend run in initdb (was Re: Bootstrap DATA is a pita)|