Re: [HACKERS] Custom compression methods

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Custom compression methods
Date: 2021-02-01 21:15:26
Message-ID: CA+TgmoZ50fvbfhoqXZWAnrFx+_7qOiPE7xNXfat0h7awbm_nxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Some more review comments:

'git am' barfs on v0001 because it's got a whitespace error.

VARFLAGS_4B_C() doesn't seem to be used in any of the patches. I'm OK
with keeping it even if it's not used just because maybe someone will
need it later but, uh, don't we need to use it someplace?

To avoid moving the goalposts for a basic install, I suggest that
--with-lz4 should default to disabled. Maybe we'll want to rethink
that at some point, but since we're just getting started with this
whole thing, I don't think now is the time.

The change to ddl.sgml doesn't seem to make sense to me. There might
be someplace where we want to explain how properties are inherited in
partitioning hierarchies, but I don't think this is the right place,
and I don't think this explanation is particularly clear.

+ This clause adds the compression method to a column. The Compression
+ method can be set from available compression methods. The built-in
+ methods are <literal>pglz</literal> and <literal>lz4</literal>.
+ If no compression method is specified, then compressible types will have
+ the default compression method <literal>pglz</literal>.

Suggest: This sets the compression method for a column. The supported
compression methods are <literal>pglz</literal> and
<literal>lz4</literal>. <literal>lz4</literal> is available only if
<literal>--with-lz4</literal> was used when building
<productname>PostgreSQL</productname>. The default is
<literal>pglz</literal>.

We should make sure, if you haven't already, that trying to create a
column with LZ4 compression fails at table creation time if the build
does not support LZ4. But, someone could also create a table using a
build that has LZ4 support and then switch to a different set of
binaries that do not have it, so we need the runtime checks also.
However, those runtime checks shouldn't fail simplify from trying to
access a table that is set to use LZ4 compression; they should only
fail if we actually need to decompress an LZ4'd value.

Since indexes don't have TOAST tables, it surprises me that
brin_form_tuple() thinks it can TOAST anything. But I guess that's not
this patch's problem, if it's a problem at all.

I like the fact that you changed the message "compressed data is
corrupt" to indicate the compression method, but I think the resulting
message doesn't follow style guidelines because I don't believe we
normally put something with a colon prefix at the beginning of a
primary error message. So instead of saying "pglz: compressed data is
corrupt" I think you should say something like "compressed pglz data
is corrupt". Also, I suggest that we take this opportunity to switch
to ereport() rather than elog() and set
errcode(ERRCODE_DATA_CORRUPTED).

What testing have you done for performance impacts? Does the patch
slow things down noticeably with pglz? (Hopefully not.) Can you
measure a performance improvement with pglz? (Hopefully so.) Is it
likely to hurt performance that there's no minimum size for lz4
compression as we have for pglz? Seems like that could result in a lot
of wasted cycles trying to compress short strings.

pglz_cmcompress() cancels compression if the resulting value would be
larger than the original one, but it looks like lz4_cmcompress() will
just store the enlarged value. That seems bad.

pglz_cmcompress() doesn't need to pfree(tmp) before elog(ERROR).

CompressionOidToId(), CompressionIdToOid() and maybe other places need
to remember the message style guidelines. Primary error messages are
not capitalized.

Why should we now have to include toast_internals.h in
reorderbuffer.c, which has no other changes? That definitely shouldn't
be necessary. If something in another header file now requires
something from toast_internals.h, then that header file would be
obliged to include toast_internals.h itself. But actually that
shouldn't happen, because the whole point of toast_internals.h is that
it should not be included in very many places at all. If we're adding
stuff there that is going to be broadly needed, we're adding it in the
wrong place.

varlena.c shouldn't need toast_internals.h either, and if it did, it
should be in alphabetical order.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michail Nikolaev 2021-02-01 21:19:03 Re: Thoughts on "killed tuples" index hint bits support on standby
Previous Message Daniel Gustafsson 2021-02-01 20:51:49 Re: Support for NSS as a libpq TLS backend