Re: [HACKERS] Custom compression methods

From: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
To: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Custom compression methods
Date: 2020-09-02 10:33:06
Message-ID: CAFiTN-uUUwJnNq1gi5UDj-LAQPOQV0PsJ_=30QLn6HFUK98RZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 2, 2020 at 4:57 AM Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com> wrote:
>
>
>
> > On Aug 13, 2020, at 4:48 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > v1-0001: As suggested by Robert, it provides the syntax support for
> > setting the compression method for a column while creating a table and
> > adding columns. However, we don't support changing the compression
> > method for the existing column. As part of this patch, there is only
> > one built-in compression method that can be set (pglz). In this, we
> > have one in-build am (pglz) and the compressed attributes will directly
> > store the oid of the AM. In this patch, I have removed the
> > pg_attr_compresion as we don't support changing the compression
> > for the existing column so we don't need to preserve the old
> > compressions.
>
> I do not like the way pglz compression is handled in this patch. After upgrading PostgreSQL to the first version with this patch included, pre-existing on-disk compressed data will not include any custom compression Oid in the header, and toast_decompress_datum will notice that and decompress the data directly using pglz_decompress. If the same data were then written back out, perhaps to another table, into a column with no custom compression method defined, it will get compressed by toast_compress_datum using DefaultCompressionOid, which is defined as PGLZ_COMPRESSION_AM_OID. That isn't a proper round-trip for the data, as when it gets re-compressed, the PGLZ_COMPRESSION_AM_OID gets written into the header, which makes the data a bit longer, but also means that it is not byte-for-byte the same as it was, which is counter-intuitive. Given that any given pglz compressed datum now has two totally different formats that might occur on disk, code may have to consider both of them, which increases code complexity, and regression tests will need to be written with coverage for both of them, which increases test complexity. It's also not easy to write the extra tests, as there isn't any way (that I see) to intentionally write out the traditional shorter form from a newer database server; you'd have to do something like a pg_upgrade test where you install an older server to write the older format, upgrade, and then check that the new server can handle it.
>
> The cleanest solution to this would seem to be removal of the compression am's Oid from the header for all compression ams, so that pre-patch written data and post-patch written data look exactly the same. The other solution is to give pglz pride-of-place as the original compression mechanism, and just say that when pglz is the compression method, no Oid gets written to the header, and only when other compression methods are used does the Oid get written. This second option seems closer to the implementation that you already have, because you already handle the decompression of data where the Oid is lacking, so all you have to do is intentionally not write the Oid when compressing using pglz.
>
> Or did I misunderstand your implementation?

Thanks for looking into it. Actually, I am planning to change this
patch such that we will use the upper 2 bits of the size field instead
of storing the amoid for the builtin compression methods.
e. g. 0x00 = pglz, 0x01 = zlib, 0x10 = other built-in, 0x11 -> custom
compression method. And when 0x11 is set then only we will store the
amoid in the toast header. I think after a week or two I will make
these changes and post my updated patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2020-09-02 10:39:14 Re: Re: [HACKERS] Custom compression methods
Previous Message Dilip Kumar 2020-09-02 10:24:26 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions