Re: [HACKERS] Custom compression methods

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Custom compression methods
Date: 2021-03-18 19:57:21
Message-ID: CA+TgmoYRsapWz2vDSQXFAEL_BAMq1aY37avcq9GAB=SUUvFhaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 18, 2021 at 10:22 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> I just realized that in the last patch (0003) I forgot to remove 2
> unused functions, CompressionMethodToId and CompressionIdToMethod.
> Removed in the latest patch.

I spent a little time polishing 0001 and here's what I came up with. I
adjusted some comments, added documentation, fixed up the commit
message, etc.

I still don't quite like the approach in 0002. I feel that the
function should not construct the tuple but modify the caller's arrays
as a side effect. And if we're absolutely committed to the design
where it does that, the comments need to call it out clearly, which
they don't.

Regarding 0003:

I think it might make sense to change the names of the compression and
decompression functions to match the names of the callers more
closely. Like, toast_decompress_datum() calls either
pglz_cmdecompress() or lz4_cmdecompress(). But, why not
pglz_decompress_datum() or lz4_decompress_datum()? The "cm" thing
doesn't really mean anything, and because the varlena is allocated by
that function itself rather than the caller, this can't be used for
anything other than TOAST.

In toast_compress_datum(), if (tmp == NULL) return
PointerGetDatum(NULL) is duplicated. It would be better to move it
after the switch.

Instead of "could not compress data with lz4" I suggest "lz4
compression failed".

In catalogs.sgml, you shouldn't mention InvalidCompressionMethod, but
you should explain what the actual possible values mean. Look at the
way attidentity and attgenerated are documented and do it like that.

In pg_column_compression() it might be a bit more elegant to add a
char *result variable or similar, and have the switch cases just set
it, and then do PG_RETURN_TEXT_P(cstring_to_text(result)) at the
bottom.

In getTableAttrs(), if the remoteVersion is new, the column gets a
different alias than if the column is old.

In dumpTableSchema(), the condition tbinfo->attcompression[j] means
exactly the thing as the condition tbinfo->attcompression[j] != '\0',
so it can't be right to test both. I think that there's some confusion
here about the data type of tbinfo->attcompression[j]. It seems to be
char *. Maybe you intended to test the first character in that second
test, but that's not what this does. But you don't need to test that
anyway because the switch already takes care of it. So I suggest (a)
removing tbinfo->attcompression[j] != '\0' from this if-statement and
(b) adding != NULL to the previous line for clarity. I would also
suggest concluding the switch with a break just for symmetry.

The patch removes 11 references to va_extsize and leaves behind 4.
None of those 4 look like things that should have been left.

The comment which says "When fetching a prefix of a compressed
external datum, account for the rawsize tracking amount of raw data,
which is stored at the beginning as an int32 value)" is no longer 100%
accurate. I suggest changing it to say something like "When fetching a
prefix of a compressed external datum, account for the space required
by va_tcinfo" and leave out the rest.

In describeOneTableDetails, the comment "compresssion info" needs to
be compressed by removing one "s".

It seems a little unfortunate that we need to include
access/toast_compression.h in detoast.h. It seems like the reason we
need to do that is because otherwise we won't have ToastCompressionId
defined and so we won't be able to prototype toast_get_compression_id.
But I think we should solve that problem by moving that file to
toast_compression.c. (I'm OK if you want to keep the files separate,
or if you want to reverse course and combine them I'm OK with that
too, but the extra header dependency is clearly a sign of a problem
with the split.)

Regarding 0005:

I think ApplyChangesToIndexes() should be renamed to something like
SetIndexStorageProperties(). It's too generic right now.

I think 0004 and 0005 should just be merged into 0003. I can't see
committing them separately. I know I was the one who made you split
the patch up in the first place, but those patches are quite small and
simple now, so it makes more sense to me to combine them.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v39-0001-Invent-HeapTupleGetRawDatum-and-friends.patch application/octet-stream 40.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2021-03-18 20:06:23 Re: cleanup temporary files after crash
Previous Message Daniil Zakhlystov 2021-03-18 19:30:09 Re: libpq compression