Re: [HACKERS] Custom compression methods

From: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, David Steele <david(at)pgmasters(dot)net>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)gmail(dot)com>, Ildus Kurbangaliev <i(dot)kurbangaliev(at)postgrespro(dot)ru>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [HACKERS] Custom compression methods
Date: 2020-09-01 23:27:22
Message-ID: E112AF88-6148-4540-B094-7D157B67B122@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Aug 13, 2020, at 4:48 AM, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> v1-0001: As suggested by Robert, it provides the syntax support for
> setting the compression method for a column while creating a table and
> adding columns. However, we don't support changing the compression
> method for the existing column. As part of this patch, there is only
> one built-in compression method that can be set (pglz). In this, we
> have one in-build am (pglz) and the compressed attributes will directly
> store the oid of the AM. In this patch, I have removed the
> pg_attr_compresion as we don't support changing the compression
> for the existing column so we don't need to preserve the old
> compressions.

I do not like the way pglz compression is handled in this patch. After upgrading PostgreSQL to the first version with this patch included, pre-existing on-disk compressed data will not include any custom compression Oid in the header, and toast_decompress_datum will notice that and decompress the data directly using pglz_decompress. If the same data were then written back out, perhaps to another table, into a column with no custom compression method defined, it will get compressed by toast_compress_datum using DefaultCompressionOid, which is defined as PGLZ_COMPRESSION_AM_OID. That isn't a proper round-trip for the data, as when it gets re-compressed, the PGLZ_COMPRESSION_AM_OID gets written into the header, which makes the data a bit longer, but also means that it is not byte-for-byte the same as it was, which is counter-intuitive. Given that any given pglz compressed datum now has two totally different formats that might occur on disk, code may have to consider both of them, which increases code complexity, and regression tests will need to be written with coverage for both of them, which increases test complexity. It's also not easy to write the extra tests, as there isn't any way (that I see) to intentionally write out the traditional shorter form from a newer database server; you'd have to do something like a pg_upgrade test where you install an older server to write the older format, upgrade, and then check that the new server can handle it.

The cleanest solution to this would seem to be removal of the compression am's Oid from the header for all compression ams, so that pre-patch written data and post-patch written data look exactly the same. The other solution is to give pglz pride-of-place as the original compression mechanism, and just say that when pglz is the compression method, no Oid gets written to the header, and only when other compression methods are used does the Oid get written. This second option seems closer to the implementation that you already have, because you already handle the decompression of data where the Oid is lacking, so all you have to do is intentionally not write the Oid when compressing using pglz.

Or did I misunderstand your implementation?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2020-09-01 23:36:02 Re: logtape.c stats don't account for unused "prefetched" block numbers
Previous Message Peter Geoghegan 2020-09-01 22:56:44 Re: Problems with the FSM, heap fillfactor, and temporal locality