Re: [HACKERS] Custom compression methods

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Custom compression methods
Date: 2021-02-03 04:55:39
Message-ID: CAFiTN-sR-14cKLn20L+vA+zwNxSe=jJVxy3N4xvbvERTKXp7BQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 3, 2021 at 2:07 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> Even more review comments, still looking mostly at 0001:
>
> If there's a reason why parallel_schedule is arranging to run the
> compression test in parallel with nothing else, the comment in that
> file should explain the reason. If there isn't, it should be added to
> a parallel group that doesn't have the maximum number of tests yet,
> probably the last such group in the file.
>
> serial_schedule should add the test in a position that roughly
> corresponds to where it appears in parallel_schedule.
>
> I believe it's relatively standard practice to put variable
> declarations at the top of the file. compress_lz4.c and
> compress_pglz.c instead put those declarations nearer to the point of
> use.
>
> compressamapi.c has an awful lot of #include directives for the code
> it actually contains. I believe that we should cut that down to what
> is required by 0001, and other patches can add more later as required.
> In fact, it's tempting to just get rid of this .c file altogether and
> make the two functions it contains static inline functions in the
> header, but I'm not 100% sure that's a good idea.
>
> The copyright dates in a number of the file headers are out of date.
>
> binary_upgrade_next_pg_am_oid and the related changes to
> CreateAccessMethod don't belong in 0001, because it doesn't support
> non-built-in compression methods. These changes and the related
> pg_dump change should be moved to the patch that adds support for
> that.
>
> The comments added to dumpTableSchema() say that "compression is
> assigned by ALTER" but don't give a reason. I think they should. I
> don't know how much they need to explain about what the code does, but
> they definitely need to explain why it does it. Also, isn't this bad?
> If we create the column with the wrong compression setting initially
> and then ALTER it, we have to rewrite the table. If it's empty, that's
> cheap, but it'd still be better not to do it at all.
>
> I'm not sure it's a good idea for dumpTableSchema() to leave out
> specifying the compression method if it happens to be pglz. I think we
> definitely shouldn't do it in binary-upgrade mode. What if we changed
> the default in a future release? For that matter, even 0002 could make
> the current approach unsafe.... I think, anyway.
>
> The changes to pg_dump.h look like they haven't had a visit from
> pgindent. You should probably try to do that for the whole patch,
> though it's a bit annoying since you'll have to manually remove
> unrelated changes to the same files that are being modified by the
> patch. Also, why the extra blank line here?
>
> GetAttributeCompression() is hard to understand. I suggest changing
> the comment to "resolve column compression specification to an OID"
> and somehow rejigger the code so that you aren't using one not-NULL
> test and one NULL test on the same variable. Like maybe change the
> first part to if (!IsStorageCompressible(typstorage)) { if
> (compression == NULL) return InvalidOid; ereport(ERROR, ...); }
>
> It puzzles me that CompareCompressionMethodAndDecompress() calls
> slot_getallattrs() just before clearing the slot. It seems like this
> ought to happen before we loop over the attributes, so that we don't
> need to call slot_getattr() every time. See the comment for that
> function. But even if we didn't do that for some reason, why would we
> do it here? If it's already been done, it shouldn't do anything, and
> if it hasn't been done, it might overwrite some of the values we just
> poked into tts_values. It also seems suspicious that we can get away
> with clearing the slot and then again marking it valid. I'm not sure
> it really works like that. Like, can't clearing the slot invalidate
> pointers stored in tts_values[]? For instance, if they are pointers
> into an in-memory heap tuple, tts_heap_clear() is going to free the
> tuple; if they are pointers into a buffer, tts_buffer_heap_clear() is
> going to unpin it. I think the supported procedure for this sort of
> thing is to have a second slot, set tts_values, tts_isnull etc. and
> then materialize the slot. After materializing the new slot, it's
> independent of the old slot, which can then be cleared. See for
> example tts_virtual_materialize(). The whole approach you've taken
> here might need to be rethought a bit. I think you are right to want
> to avoid copying everything over into a new slot if nothing needs to
> be done, and I think we should definitely keep that optimization, but
> I think if you need to copy stuff, you have to do the above procedure
> and then continue using the other slot instead of the original one.
> Some places I think we have functions that return either the original
> slot or a different one depending on how it goes; that might be a
> useful idea here. But, you also can't just spam-create slots; it's
> important that whatever ones we end up with get reused for every
> tuple.
>
> Doesn't the change to describeOneTableDetails() require declaring
> changing the declaration of char *headers[11] to char *headers[12]?
> How does this not fail Assert(cols <= lengthof(headers))?
>
> Why does describeOneTableDetais() arrange to truncate the printed
> value? We don't seem to do that for the other column properties, and
> it's not like this one is particularly long.
>
> Perhaps the changes to pg_am.dat shouldn't remove the blank line?
>
> I think the comment to pg_attribute.h could be rephrased to stay
> something like: "OID of compression AM. Must be InvalidOid if and only
> if typstorage is 'a' or 'b'," replacing 'a' and 'b' with whatever the
> right letters are. This would be shorter and I think also clearer than
> what you have
>
> The first comment change in postgres.h is wrong. You changed
> va_extsize to "size in va_extinfo" but the associated structure
> definition is unchanged, so the comment shouldn't be changed either.
>
> In toast_internals.h, you end using 30 as a constant several times but
> have no #define for it. You do have a #define for RAWSIZEMASK, but
> that's really a derived value from 30. Also, it's not a great name
> because it's kind of generic. So how about something like:
>
> #define TOAST_RAWSIZE_BITS 30
> #define TOAST_RAWSIZE_MASK ((1 << (TOAST_RAW_SIZE_BITS + 1)) - 1)
>
> But then again on second thought, this 30 seems to be the same 30 that
> shows up in the changes to postgres.h, and there again 0x3FFFFFFF
> shows up too. So maybe we should actually be defining these constants
> there, using names like VARLENA_RAWSIZE_BITS and VARLENA_RAWSIZE_MASK
> and then having toast_internals.h use those constants as well.
>
> Taken with the email I sent yesterday, I think this is a more or less
> complete review of 0001. Although there are a bunch of things to fix
> here still, I don't think this is that far from being committable. I
> don't at this point see too much in terms of big design problems.
> Probably the CompareCompressionMethodAndDecompress() is the closest to
> a design-level problem, and certainly something needs to be done about
> it, but even that is a fairly localized problem in the context of the
> entire patch.

Thanks, Robert for the detailed review. I will work on these comments
and post the updated patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-02-03 04:56:59 Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Previous Message japin 2021-02-03 04:30:56 Re: memory leak in auto_explain