Re: Re: [HACKERS] Custom compression methods

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Re: [HACKERS] Custom compression methods
Date: 2020-09-19 07:49:48
Message-ID: CAFiTN-vcbfy5ScKVUp16c1N_wzP0RL6EkPBAg_Jm3eDK0ftO5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 25, 2020 at 11:20 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Mon, Aug 24, 2020 at 2:12 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > IIUC, the main reason for using this flag is for taking the decision
> > whether we need any detoasting for this tuple. For example, if we are
> > rewriting the table because the compression method is changed then if
> > HEAP_HASCUSTOMCOMPRESSED bit is not set in the tuple header and tuple
> > length, not tup->t_len > TOAST_TUPLE_THRESHOLD then we don't need to
> > call heap_toast_insert_or_update function for this tuple. Whereas if
> > this flag is set then we need to because we might need to uncompress
> > and compress back using a different compression method. The same is
> > the case with INSERT into SELECT * FROM.
>
> This doesn't really seem worth it to me. I don't see how we can
> justify burning an on-disk bit just to save a little bit of overhead
> during a rare maintenance operation. If there's a performance problem
> here we need to look for another way of mitigating it. Slowing CLUSTER
> and/or VACUUM FULL down by a large amount for this feature would be
> unacceptable, but is that really a problem? And if so, can we solve it
> without requiring this bit?

Okay, if we want to avoid keeping the bit then there are multiple ways
to handle this, but the only thing is none of that will be specific to
those scenarios.
approach1. In ExecModifyTable, we can process the source tuple and see
if any of the varlena attributes is compressed and its stored
compression method is not the same as the target table attribute then
we can decompress it.
approach2. In heap_prepare_insert, always call the
heap_toast_insert_or_update, therein we can check if any of the source
tuple attributes are compressed with different compression methods
then the target table then we can decompress it.

With either of the approach, we have to do this in a generic path
because the source of the tuple is not known, I mean it can be a
output from a function, or the join tuple or a subquery. So in the
attached patch, I have implemented with approach1.

For testing, I have implemented using approach1 as well as using
approach2 and I have checked the performance of the pg_bench to see
whether it impacts the performance of the generic paths or not, but I
did not see any impact.

>
> > I have already extracted these 2 patches from the main patch set.
> > But, in these patches, I am still storing the am_oid in the toast
> > header. I am not sure can we get rid of that at least for these 2
> > patches? But, then wherever we try to uncompress the tuple we need to
> > know the tuple descriptor to get the am_oid but I think that is not
> > possible in all the cases. Am I missing something here?
>
> I think we should instead use the high bits of the toast size word for
> patches #1-#4, as discussed upthread.
>
> > > > Patch #3. Add support for changing the compression method associated
> > > > with a column, forcing a table rewrite.
> > > >
> > > > Patch #4. Add support for PRESERVE, so that you can change the
> > > > compression method associated with a column without forcing a table
> > > > rewrite, by including the old method in the PRESERVE list, or with a
> > > > rewrite, by not including it in the PRESERVE list.
> >
> > Does this make sense to have Patch #3 and Patch #4, without having
> > Patch #5? I mean why do we need to support rewrite or preserve unless
> > we have the customer compression methods right? because the build-in
> > compression method can not be dropped so why do we need to preserve?
>
> I think that patch #3 makes sense because somebody might have a table
> that is currently compressed with pglz and they want to switch to lz4,
> and I think patch #4 also makes sense because they might want to start
> using lz4 for future data but not force a rewrite to get rid of all
> the pglz data they've already got. Those options are valuable as soon
> as there is more than one possible compression algorithm, even if
> they're all built in. Now, as I said upthread, it's also true that you
> could do #5 before #3 and #4. I don't think that's insane. But I
> prefer it in the other order, because I think having #5 without #3 and
> #4 wouldn't be too much fun for users.

Details of the attached patch set

0001: This provides syntax to set the compression method from the
built-in compression method (pglz or zlib). pg_attribute stores the
compression method (char) and there are conversion functions to
convert that compression method to the built-in compression array
index. As discussed up thread the first 2 bits will be storing the
compression method index using that we can directly get the handler
routing using the built-in compression method array.

0002: This patch provides an option to changes the compression method
for an existing column and it will rewrite the table.

Next, I will be working on providing an option to alter the
compression method without rewriting the whole table, basically, we
can provide a preserve list to preserve old compression methods.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
v2-0002-alter-table-set-compression.patch application/octet-stream 9.6 KB
v2-0001-Built-in-compression-method.patch application/octet-stream 175.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2020-09-19 08:18:40 Re: Resetting spilled txn statistics in pg_stat_replication
Previous Message legrand legrand 2020-09-19 07:35:42 Re: [PATCH] Add features to pg_stat_statements