Re: [HACKERS] Custom compression methods

From: Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] Custom compression methods
Date: 2017-11-21 14:47:17
Message-ID: 20171121174717.69ecd8f4@wp.localdomain
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Mon, 20 Nov 2017 00:04:53 +0100
Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:

> I did a review of this today, and I think there are some things that
> need improvement / fixing.
>
> Firstly, some basic comments from just eye-balling the diff, then some
> bugs I discovered after writing an extension adding lz4.
>
> 1) formatRelOptions/freeRelOptions are no longer needed (I see Ildar
> already pointer that out)

I removed freeRelOptions, but formatRelOptions is used in other place.

>
> 2) There's unnecessary whitespace (extra newlines) on a couple of
> places, which is needlessly increasing the size of the patch. Small
> difference, but annoying.

Cleaned up.

>
> 3) tuptoaster.c
>
> Why do you change 'info' from int32 to uint32? Seems unnecessary.

That's because I use highest bit, and it makes number negative for
int32. I use right shifting to get that bit and right shift on negative
gives negative value too.

>
> Adding new 'att' variable in toast_insert_or_update is confusing, as
> there already is 'att' in the very next loop. Technically it's
> correct, but I'd bet it'll lead to some WTF?! moments later. I
> propose to just use TupleDescAttr(tupleDesc,i) on the two places
> where it matters, around line 808.
>
> There are no comments for init_compression_options_htab and
> get_compression_options_info, so that needs to be fixed. Moreover, the
> names are confusing because what we really get is not just 'options'
> but the compression routines too.

Removed extra 'att', and added comments.

>
> 4) gen_db_file_maps probably shouldn't do the fprints, right?
>
> 5) not sure why you modify src/tools/pgindent/exclude_file_patterns

My bad, removed these lines.

>
> 6) I'm rather confused by AttributeCompression vs. ColumnCompression.
> I mean, attribute==column, right? Of course, one is for data from
> parser, the other one is for internal info. But can we make the
> naming clearer?

For now I have renamed AttributeCompression to CompressionOptions, not
sure that's a good name but at least it gives less confusion.

>
> 7) The docs in general are somewhat unsatisfactory, TBH. For example
> the ColumnCompression has no comments, unlike everything else in
> parsenodes. Similarly for the SGML docs - I suggest to expand them to
> resemble FDW docs
> (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
> also follows the handler/routines pattern.

I've added more comments. I think I'll add more documentation if the
committers will approve current syntax.

>
> 8) One of the unclear things if why we even need 'drop' routing. It
> seems that if it's defined DropAttributeCompression does something.
> But what should it do? I suppose dropping the options should be done
> using dependencies (just like we drop columns in this case).
>
> BTW why does DropAttributeCompression mess with att->attisdropped in
> this way? That seems a bit odd.

'drop' routine could be useful. An extension could do
something related with the attribute, like remove extra tables or
something else. The compression options will not be removed after
unlinking compression method from a column because there is still be
stored compressed data in that column.

That 'attisdropped' part has been removed.

>
> 9) configure routines that only check if (options != NIL) and then
> error out (like tsvector_configure) seem a bit unnecessary. Just
> allow it to be NULL in CompressionMethodRoutine, and throw an error
> if options is not NIL for such compression method.

Good idea, done.

>
> 10) toast_compress_datum still does this:
>
> if (!ac && (valsize < PGLZ_strategy_default->min_input_size ||
> valsize > PGLZ_strategy_default->max_input_size))
>
> which seems rather pglz-specific (the naming is a hint). Why shouldn't
> this be specific to compression, exposed either as min/max constants,
> or wrapped in another routine - size_is_valid() or something like
> that?

I agree, moved to the next block related with pglz.

>
> 11) The comments in toast_compress_datum probably need updating, as it
> still references to pglz specifically. I guess the new compression
> methods do matter too.

Done.

>
> 12) get_compression_options_info organizes the compression info into a
> hash table by OID. The hash table implementation assumes the hash key
> is at the beginning of the entry, but AttributeCompression is defined
> like this:
>
> typedef struct
> {
> CompressionMethodRoutine *routine;
> List *options;
> Oid cmoptoid;
> } AttributeCompression;
>
> Which means get_compression_options_info is busted, will never lookup
> anything, and the hash table will grow by adding more and more entries
> into the same bucket. Of course, this has extremely negative impact on
> performance (pretty much arbitrarily bad, depending on how many
> entries you've already added to the hash table).
>
> Moving the OID to the beginning of the struct fixes the issue.

Yeah, I fixed it before, but somehow managed to do not include it to the
patch.

>
> 13) When writing the experimental extension, I was extremely confused
> about the regular varlena headers, custom compression headers, etc. In
> the end I stole the code from tsvector.c and whacked it a bit until it
> worked, but I wouldn't dare to claim I understand how it works.
>
> This needs to be documented somewhere. For example postgres.h has a
> bunch of paragraphs about varlena headers, so perhaps it should be
> there? I see the patch tweaks some of the constants, but does not
> update the comment at all.

This point is good, I'm not sure how this documentation should look
like. I've just assumed that people should have deep undestanding of
varlenas if they're going to compress them. But now it's easy to make
mistake there. Maybe I should add some functions that help to construct
varlena, with different headers. I like the way is how jsonb is
constructed. It uses StringInfo and there are few helper functions
(reserveFromBuffer, appendToBuffer and others). Maybe
they should be not static.

>
> Perhaps it would be useful to provide some additional macros making
> access to custom-compressed varlena values easier. Or perhaps the
> VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already support that?
> This part is not very clear to me.

These macros will work, custom compressed varlenas behave like old
compressed varlenas.

> > Still it's a problem if the user used for example `SELECT
> > <compressed_column> INTO * FROM *` because postgres will copy
> > compressed tuples, and there will not be any dependencies between
> > destination and the options.
> >
>
> This seems like a rather fatal design flaw, though. I'd say we need to
> force recompression of the data, in such cases. Otherwise all the
> dependency tracking is rather pointless.

Fixed this problem too. I've added recompression for datum that use
custom compression.

--
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
custom_compression_methods_v5.patch text/x-patch 251.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-11-21 14:59:00 Re: Inlining functions with "expensive" parameters
Previous Message Andreas Karlsson 2017-11-21 14:45:37 Re: [HACKERS] GnuTLS support