Re: [HACKERS] Custom compression methods

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Custom compression methods
Date: 2021-03-18 11:10:41
Message-ID: CAFiTN-sBH65e4B5kQB5MT8BQOPajkrKvSdKA-07+ra3f2x4vCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 18, 2021 at 1:32 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> ).On Mon, Mar 15, 2021 at 6:58 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > - Adding all these indirect function calls via toast_compression[] just
> > for all of two builtin methods isn't fun either.
>
> Yeah, it feels like this has too many layers of indirection now. Like,
> toast_decompress_datum() first gets TOAST_COMPRESS_METHOD(attr). Then
> it calls CompressionIdToMethod to convert one constant (like
> TOAST_PGLZ_COMPRESSION_ID) to another constant with a slightly
> different name (like TOAST_PGLZ_COMPRESSION). Then it calls
> GetCompressionRoutines() to get hold of the function pointers. Then it
> does an indirect functional call. That seemed like a pretty reasonable
> idea when we were trying to support arbitrary compression AMs without
> overly privileging the stuff that was built into core, but if we're
> just doing stuff that's built into core, then we could just switch
> (TOAST_COMPRESS_METHOD(attr)) and call the correct function. In fact,
> we could even move the stuff from toast_compression.c into detoast.c,
> which would allow the compiler to optimize better (e.g. by inlining,
> if it wants).
>
> The same applies to toast_decompress_datum_slice().

Changed this, but I have still kept the functions in
toast_compression.c. I think keeping compression related
functionality in a separate file looks much cleaner. Please have a
look and let me know that whether you still feel we should move it ti
detoast.c. If the reason is that we can inline, then I feel we are
already paying cost of compression/decompression and compare to that
in lining a function will not make much difference.

> There's a similar issue in toast_get_compression_method() and the only
> caller, pg_column_compression(). Here the multiple mapping layers and
> the indirect function call are split across those two functions rather
> than all in the same one, but here again one could presumably find a
> place to just switch on TOAST_COMPRESS_METHOD(attr) or
> VARATT_EXTERNAL_GET_COMPRESSION(attr) and return "pglz" or "lz4"
> directly.

I have simplified that, only one level of function call from
pg_column_compression, I have kept a toast_get_compression_id
function because in later patch 0005, we will be using that for
getting the compression id from the compressed data.

> In toast_compress_datum(), I think we could have a switch that invokes
> the appropriate compressor based on cmethod and sets a variable to the
> value to be passed as the final argument of
> TOAST_COMPRESS_SET_SIZE_AND_METHOD().

Done

> Likewise, I suppose CompressionNameToMethod could at least be
> simplified to use constant strings rather than stuff like
> toast_compression[TOAST_PGLZ_COMPRESSION_ID].cmname.

Done

Other changes:
- As suggested by Andres, remove compression method comparision from
eualTupleDesc, because it is not required now.
- I found one problem in existing patch, the problem was in
detoast_attr_slice, if externally stored data is compressed then we
compute max possible compressed size to fetch based on the slice
length, for that we were using pglz_maximum_compressed_size, which is
not correct for lz4. For lz4, I think we need to fetch the complete
compressed data. We might think that for lz4 we might compute lie
Min(LZ4_compressBound(slicelength, total_compressed_size); But IMHO,
we can not do that and the reason is same that why we should not use
PGLZ_MAX_OUTPUT for pglz (explained in the comment atop
pglz_maximum_compressed_size).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v37-0001-Get-datum-from-tuple-which-doesn-t-contain-exter.patch text/x-patch 37.4 KB
v37-0003-Built-in-compression-method.patch text/x-patch 104.5 KB
v37-0004-Add-default_toast_compression-GUC.patch text/x-patch 11.8 KB
v37-0005-Alter-table-set-compression.patch text/x-patch 21.5 KB
v37-0002-Expand-the-external-data-before-forming-the-tupl.patch text/x-patch 8.2 KB
v37-0006-default-to-with-lz4.patch text/x-patch 1.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-03-18 11:11:41 Re: Parallel INSERT (INTO ... SELECT ...)
Previous Message Thomas Munro 2021-03-18 11:05:11 Re: fdatasync performance problem with large number of DB files