Re: [HACKERS] Custom compression methods

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org, Ildar Musin <i(dot)musin(at)postgrespro(dot)ru>
Subject: Re: [HACKERS] Custom compression methods
Date: 2017-11-30 19:47:13
Message-ID: 9dc0d3ba-5fa2-a7b0-5789-fff3c60cef1c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/30/2017 04:20 PM, Ildus Kurbangaliev wrote:
> On Thu, 30 Nov 2017 00:30:37 +0100
> Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> ...
>
>> I can imagine other interesting use cases - for example values in
>> JSONB columns often use the same "schema" (keys, nesting, ...), so
>> can I imagine building a "dictionary" of JSON keys for the whole
>> column ...
>>
>> Ildus, is this a use case you've been aiming for, or were you aiming
>> to use the new API in a different way?
>
> Thank you for such good overview. I agree that pglz is pretty good as
> general compression method, and there's no point to change it, at
> least now.
>
> I see few useful use cases for compression methods, it's special
> compression methods for int[], timestamp[] for time series and yes,
> dictionaries for jsonb, for which I have even already created an
> extension (https://github.com/postgrespro/jsonbd). It's working and
> giving promising results.
>

I understand the reluctance to put everything into core, particularly
for complex patches that evolve quickly. Also, not having to put
everything into core is kinda why we have extensions.

But perhaps some of the simpler cases would be good candidates for core,
making it possible to test the feature?

>>
>> I wonder if the patch can be improved to handle this use case better.
>> For example, it requires knowledge the actual data type, instead of
>> treating it as opaque varlena / byte array. I see tsvector compression
>> does that by checking typeid in the handler.
>>
>> But that fails for example with this example
>>
>> db=# create domain x as tsvector;
>> CREATE DOMAIN
>> db=# create table t (a x compressed ts1);
>> ERROR: unexpected type 28198672 for tsvector compression handler
>>
>> which means it's a few brick shy to properly support domains. But I
>> wonder if this should be instead specified in CREATE COMPRESSION
>> METHOD instead. I mean, something like
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>> TYPE tsvector;
>>
>> When type is no specified, it applies to all varlena values. Otherwise
>> only to that type. Also, why not to allow setting the compression as
>> the default method for a data type, e.g.
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER tsvector_compression_handler
>> TYPE tsvector DEFAULT;
>>
>> would automatically add 'COMPRESSED ts1' to all tsvector columns in
>> new CREATE TABLE commands.
>
> Initial version of the patch contains ALTER syntax that change
> compression method for whole types, but I have decided to remove
> that functionality for now because the patch is already quite complex
> and it could be added later as separate patch.
>
> Syntax was:
> ALTER TYPE <type> SET COMPRESSION <cm>;
>
> Specifying the supported type for the compression method is a good idea.
> Maybe the following syntax would be better?
>
> CREATE COMPRESSION METHOD ts1 FOR tsvector HANDLER
> tsvector_compression_handler;
>

Understood. Good to know you've considered it, and I agree it doesn't
need to be there from the start (which makes the patch simpler).

>>
>> BTW do you expect the tsvector compression to be generally useful, or
>> is it meant to be used only by the tests? If generally useful,
>> perhaps it should be created in pg_compression by default. If only
>> for tests, maybe it should be implemented in an extension in contrib
>> (thus also serving as example how to implement new methods).
>>
>> I haven't thought about the JSONB use case very much, but I suppose
>> that could be done using the configure/drop methods. I mean,
>> allocating the dictionary somewhere (e.g. in a table created by an
>> extension?). The configure method gets the Form_pg_attribute record,
>> so that should be enough I guess.
>>
>> But the patch is not testing those two methods at all, which seems
>> like something that needs to be addresses before commit. I don't
>> expect a full-fledged JSONB compression extension, but something
>> simple that actually exercises those methods in a meaningful way.
>
> I will move to tsvector_compression_handler to separate extension in
> the next version. I added it more like as example, but also it could be
> used to achieve a better compression for tsvectors. Tests on maillists
> database ('archie' tables):
>
> usual compression:
>
> maillists=# select body_tsvector, subject_tsvector into t1 from
> messages; SELECT 1114213
> maillists=# select pg_size_pretty(pg_total_relation_size('t1'));
> pg_size_pretty
> ----------------
> 1637 MB
> (1 row)
>
> tsvector_compression_handler:
> maillists=# select pg_size_pretty(pg_total_relation_size('t2'));
> pg_size_pretty
> ----------------
> 1521 MB
> (1 row)
>
> lz4:
> maillists=# select pg_size_pretty(pg_total_relation_size('t3'));
> pg_size_pretty
> ----------------
> 1487 MB
> (1 row)
>
> I don't stick to tsvector_compression_handler, I think if there
> will some example that can use all the features then
> tsvector_compression_handler could be replaced with it.
>

OK. I think it's a nice use case (and nice gains on the compression
ratio), demonstrating the datatype-aware compression. The question is
why shouldn't this be built into the datatypes directly?

That would certainly be possible for tsvector, although it wouldn't be
as transparent (the datatype code would have to support it explicitly).

I'm a bit torn on this. The custom compression method patch makes the
compression mostly transparent for the datatype code (by adding an extra
"compression" header). But it's coupled to the datatype quite strongly
as it requires knowledge of the data type internals.

It's a bit less coupled for "generic" datatypes (e.g. arrays of other
types), where it may add important information (e.g. that the array
represents a chunk of timeseries data, which the array code can't
possibly know).

>
> My extension for jsonb dictionaries is big enough and I'm not ready
> to try to include it to the patch. I just see the use of 'drop'
> method, since there should be way for extension to clean its
> resources, but I don't see some simple enough usage for it in tests.
> Maybe just dummy methods for 'drop' and 'configure' will be enough
> for testing purposes.
>

OK.

>>
>> Similarly for the compression options - we need to test that the WITH
>> part is handled correctly (tsvector does not provide configure
>> method).
>
> I could add some options to tsvector_compression_handler, like options
> that change pglz_compress parameters.
>

+1 for doing that

>>
>> Which reminds me I'm confused by pg_compression_opt. Consider this:
>>
>> CREATE COMPRESSION METHOD ts1 HANDLER
>> tsvector_compression_handler; CREATE TABLE t (a tsvector COMPRESSED
>> ts1);
>>
>> db=# select * from pg_compression_opt ;
>> cmoptoid | cmname | cmhandler | cmoptions
>> ----------+--------+------------------------------+-----------
>> 28198689 | ts1 | tsvector_compression_handler |
>> (1 row)
>>
>> DROP TABLE t;
>>
>> db=# select * from pg_compression_opt ;
>> cmoptoid | cmname | cmhandler | cmoptions
>> ----------+--------+------------------------------+-----------
>> 28198689 | ts1 | tsvector_compression_handler |
>> (1 row)
>>
>> db=# DROP COMPRESSION METHOD ts1;
>> ERROR: cannot drop compression method ts1 because other objects
>> depend on it
>> DETAIL: compression options for ts1 depends on compression method
>> ts1
>> HINT: Use DROP ... CASCADE to drop the dependent objects too.
>>
>> I believe the pg_compression_opt is actually linked to pg_attribute,
>> in which case it should include (attrelid,attnum), and should be
>> dropped when the table is dropped.
>>
>> I suppose it was done this way to work around the lack of
>> recompression (i.e. the compressed value might have ended in other
>> table), but that is no longer true.
>
> Good point, since there is recompression now, the options could be
> safely removed in case of dropping table. It will complicate pg_upgrade
> but I think this is solvable.
>

+1 to do that. I've never dealt with pg_upgrade, but I suppose this
shouldn't be more complicated than for custom data types, right?

>>
>> A few more comments:
>>
>> 1) The patch makes optionListToArray (in foreigncmds.c) non-static,
>> but it's not used anywhere. So this seems like change that is no
>> longer necessary.
>
> I use this function in CreateCompressionOptions.
>

Ah, my mistake. I only did 'git grep' which however does not search in
new files (not added to git). But it seems a bit strange to have the
function in foreigncmds.c, though, now that we use it outside of FDWs.

>>
>> 2) I see we add two un-reserved keywords in gram.y - COMPRESSION and
>> COMPRESSED. Perhaps COMPRESSION would be enough? I mean, we could do
>>
>> CREATE TABLE t (c TEXT COMPRESSION cm1);
>> ALTER ... SET COMPRESSION name ...
>> ALTER ... SET COMPRESSION none;
>>
>> Although I agree the "SET COMPRESSION none" is a bit strange.
>
> I agree, I've already changed syntax for the next version of the patch.
> It's COMPRESSION instead of COMPRESSED and DROP COMPRESSION instead of
> SET NOT COMPRESSED. Minus one word from grammar and it looks nicer.
>

I'm not sure DROP COMPRESSION is a good idea. It implies that the data
will be uncompressed, but I assume it merely switches to the built-in
compression (pglz), right? Although "SET COMPRESSION none" has the same
issue ...

BTW, when you do DROP COMPRESSION (or whatever syntax we end up using),
will that remove the dependencies on the compression method? I haven't
tried, so maybe it does.

>>
>> 3) heap_prepare_insert uses this chunk of code
>>
>> + else if (HeapTupleHasExternal(tup)
>> + || RelationGetDescr(relation)->tdflags &
>> TD_ATTR_CUSTOM_COMPRESSED
>> + || HeapTupleHasCustomCompressed(tup)
>> + || tup->t_len > TOAST_TUPLE_THRESHOLD)
>>
>> Shouldn't that be rather
>>
>> + else if (HeapTupleHasExternal(tup)
>> + || (RelationGetDescr(relation)->tdflags &
>> TD_ATTR_CUSTOM_COMPRESSED
>> + && HeapTupleHasCustomCompressed(tup))
>> + || tup->t_len > TOAST_TUPLE_THRESHOLD)
>
> These conditions used for opposite directions.
> HeapTupleHasCustomCompressed(tup) is true if tuple has compressed
> datums inside and we need to decompress them first, and
> TD_ATTR_CUSTOM_COMPRESSED flag means that relation we're putting the
> data have columns with custom compression and maybe we need to compress
> datums in current tuple.
>

Ah, right, now it makes sense. Thanks for explaining.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2017-11-30 19:56:06 Re: [HACKERS] Removing LEFT JOINs in more cases
Previous Message Robert Haas 2017-11-30 19:17:54 Re: [HACKERS] Proposal: Local indexes for partitioned table