Re: [HACKERS] Custom compression methods

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Custom compression methods
Date: 2017-12-13 10:10:46
Message-ID: b1e047de-67d5-2fd8-ad9e-93434497ad91@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/13/2017 01:54 AM, Robert Haas wrote:
> On Tue, Dec 12, 2017 at 5:07 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>> I definitely think there's a place for compression built right into
>>> the data type. I'm still happy about commit
>>> 145343534c153d1e6c3cff1fa1855787684d9a38 -- although really, more
>>> needs to be done there. But that type of improvement and what is
>>> proposed here are basically orthogonal. Having either one is good;
>>> having both is better.
>>>
>> Why orthogonal?
>
> I mean, they are different things. Data types are already free to
> invent more compact representations, and that does not preclude
> applying pglz to the result.
>
>> For example, why couldn't (or shouldn't) the tsvector compression be
>> done by tsvector code itself? Why should we be doing that at the varlena
>> level (so that the tsvector code does not even know about it)?
>
> We could do that, but then:
>
> 1. The compression algorithm would be hard-coded into the system
> rather than changeable. Pluggability has some value.
>

Sure. I agree extensibility of pretty much all parts is a significant
asset of the project.

> 2. If several data types can benefit from a similar approach, it has
> to be separately implemented for each one.
>

I don't think the current solution improves that, though. If you want to
exploit internal features of individual data types, it pretty much
requires code customized to every such data type.

For example you can't take the tsvector compression and just slap it on
tsquery, because it relies on knowledge of internal tsvector structure.
So you need separate implementations anyway.

> 3. Compression is only applied to large-ish values. If you are just
> making the data type representation more compact, you probably want to
> apply the new representation to all values. If you are compressing in
> the sense that the original data gets smaller but harder to interpret,
> then you probably only want to apply the technique where the value is
> already pretty wide, and maybe respect the user's configured storage
> attributes. TOAST knows about some of that kind of stuff.
>

Good point. One such parameter that I really miss is compression level.
I can imagine tuning it through CREATE COMPRESSION METHOD, but it does
not seem quite possible with compression happening in a datatype.

>> It seems to me the main reason is that tsvector actually does not allow
>> us to do that, as there's no good way to distinguish the different
>> internal format (e.g. by storing a flag or format version in some sort
>> of header, etc.).
>
> That is also a potential problem, although I suspect it is possible to
> work around it somehow for most data types. It might be annoying,
> though.
>
>>> I think there may also be a place for declaring that a particular data
>>> type has a "privileged" type of TOAST compression; if you use that
>>> kind of compression for that data type, the data type will do smart
>>> things, and if not, it will have to decompress in more cases. But I
>>> think this infrastructure makes that kind of thing easier, not harder.
>>
>> I don't quite understand how that would be done. Isn't TOAST meant to be
>> entirely transparent for the datatypes? I can imagine custom TOAST
>> compression (which is pretty much what the patch does, after all), but I
>> don't see how the datatype could do anything smart about it, because it
>> has no idea which particular compression was used. And considering the
>> OIDs of the compression methods do change, I'm not sure that's fixable.
>
> I don't think TOAST needs to be entirely transparent for the
> datatypes. We've already dipped our toe in the water by allowing some
> operations on "short" varlenas, and there's really nothing to prevent
> a given datatype from going further. The OID problem you mentioned
> would presumably be solved by hard-coding the OIDs for any built-in,
> privileged compression methods.
>

Stupid question, but what do you mean by "short" varlenas?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Khandekar 2017-12-13 10:18:37 Re: [HACKERS] UPDATE of partition key
Previous Message Christoph Berg 2017-12-13 10:10:37 Re: pgsql: Provide overflow safe integer math inline functions.