Re: [HACKERS] Custom compression methods

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Custom compression methods
Date: 2021-02-02 20:37:19
Message-ID: CA+TgmoZzQWBEkoZEon=TXwJhwZGXPUW7ma4xxh9DaJoHB8CfWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Even more review comments, still looking mostly at 0001:

If there's a reason why parallel_schedule is arranging to run the
compression test in parallel with nothing else, the comment in that
file should explain the reason. If there isn't, it should be added to
a parallel group that doesn't have the maximum number of tests yet,
probably the last such group in the file.

serial_schedule should add the test in a position that roughly
corresponds to where it appears in parallel_schedule.

I believe it's relatively standard practice to put variable
declarations at the top of the file. compress_lz4.c and
compress_pglz.c instead put those declarations nearer to the point of
use.

compressamapi.c has an awful lot of #include directives for the code
it actually contains. I believe that we should cut that down to what
is required by 0001, and other patches can add more later as required.
In fact, it's tempting to just get rid of this .c file altogether and
make the two functions it contains static inline functions in the
header, but I'm not 100% sure that's a good idea.

The copyright dates in a number of the file headers are out of date.

binary_upgrade_next_pg_am_oid and the related changes to
CreateAccessMethod don't belong in 0001, because it doesn't support
non-built-in compression methods. These changes and the related
pg_dump change should be moved to the patch that adds support for
that.

The comments added to dumpTableSchema() say that "compression is
assigned by ALTER" but don't give a reason. I think they should. I
don't know how much they need to explain about what the code does, but
they definitely need to explain why it does it. Also, isn't this bad?
If we create the column with the wrong compression setting initially
and then ALTER it, we have to rewrite the table. If it's empty, that's
cheap, but it'd still be better not to do it at all.

I'm not sure it's a good idea for dumpTableSchema() to leave out
specifying the compression method if it happens to be pglz. I think we
definitely shouldn't do it in binary-upgrade mode. What if we changed
the default in a future release? For that matter, even 0002 could make
the current approach unsafe.... I think, anyway.

The changes to pg_dump.h look like they haven't had a visit from
pgindent. You should probably try to do that for the whole patch,
though it's a bit annoying since you'll have to manually remove
unrelated changes to the same files that are being modified by the
patch. Also, why the extra blank line here?

GetAttributeCompression() is hard to understand. I suggest changing
the comment to "resolve column compression specification to an OID"
and somehow rejigger the code so that you aren't using one not-NULL
test and one NULL test on the same variable. Like maybe change the
first part to if (!IsStorageCompressible(typstorage)) { if
(compression == NULL) return InvalidOid; ereport(ERROR, ...); }

It puzzles me that CompareCompressionMethodAndDecompress() calls
slot_getallattrs() just before clearing the slot. It seems like this
ought to happen before we loop over the attributes, so that we don't
need to call slot_getattr() every time. See the comment for that
function. But even if we didn't do that for some reason, why would we
do it here? If it's already been done, it shouldn't do anything, and
if it hasn't been done, it might overwrite some of the values we just
poked into tts_values. It also seems suspicious that we can get away
with clearing the slot and then again marking it valid. I'm not sure
it really works like that. Like, can't clearing the slot invalidate
pointers stored in tts_values[]? For instance, if they are pointers
into an in-memory heap tuple, tts_heap_clear() is going to free the
tuple; if they are pointers into a buffer, tts_buffer_heap_clear() is
going to unpin it. I think the supported procedure for this sort of
thing is to have a second slot, set tts_values, tts_isnull etc. and
then materialize the slot. After materializing the new slot, it's
independent of the old slot, which can then be cleared. See for
example tts_virtual_materialize(). The whole approach you've taken
here might need to be rethought a bit. I think you are right to want
to avoid copying everything over into a new slot if nothing needs to
be done, and I think we should definitely keep that optimization, but
I think if you need to copy stuff, you have to do the above procedure
and then continue using the other slot instead of the original one.
Some places I think we have functions that return either the original
slot or a different one depending on how it goes; that might be a
useful idea here. But, you also can't just spam-create slots; it's
important that whatever ones we end up with get reused for every
tuple.

Doesn't the change to describeOneTableDetails() require declaring
changing the declaration of char *headers[11] to char *headers[12]?
How does this not fail Assert(cols <= lengthof(headers))?

Why does describeOneTableDetais() arrange to truncate the printed
value? We don't seem to do that for the other column properties, and
it's not like this one is particularly long.

Perhaps the changes to pg_am.dat shouldn't remove the blank line?

I think the comment to pg_attribute.h could be rephrased to stay
something like: "OID of compression AM. Must be InvalidOid if and only
if typstorage is 'a' or 'b'," replacing 'a' and 'b' with whatever the
right letters are. This would be shorter and I think also clearer than
what you have

The first comment change in postgres.h is wrong. You changed
va_extsize to "size in va_extinfo" but the associated structure
definition is unchanged, so the comment shouldn't be changed either.

In toast_internals.h, you end using 30 as a constant several times but
have no #define for it. You do have a #define for RAWSIZEMASK, but
that's really a derived value from 30. Also, it's not a great name
because it's kind of generic. So how about something like:

#define TOAST_RAWSIZE_BITS 30
#define TOAST_RAWSIZE_MASK ((1 << (TOAST_RAW_SIZE_BITS + 1)) - 1)

But then again on second thought, this 30 seems to be the same 30 that
shows up in the changes to postgres.h, and there again 0x3FFFFFFF
shows up too. So maybe we should actually be defining these constants
there, using names like VARLENA_RAWSIZE_BITS and VARLENA_RAWSIZE_MASK
and then having toast_internals.h use those constants as well.

Taken with the email I sent yesterday, I think this is a more or less
complete review of 0001. Although there are a bunch of things to fix
here still, I don't think this is that far from being committable. I
don't at this point see too much in terms of big design problems.
Probably the CompareCompressionMethodAndDecompress() is the closest to
a design-level problem, and certainly something needs to be done about
it, but even that is a fairly localized problem in the context of the
entire patch.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2021-02-02 21:17:49 Re: [PATCH] Doc: improve documentation of oid columns that can be zero. (correct version)
Previous Message Jacob Champion 2021-02-02 20:33:35 Re: Support for NSS as a libpq TLS backend