Quick Links

Re: Faster compression, again

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Daniel Farina <daniel(at)heroku(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Faster compression, again
Date:	2012-03-15 01:44:56
Message-ID:	10658.1331775896@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Mar 14, 2012 at 6:08 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> Doesn't it always start with a header of two int32 values where the
>> first must be smaller than the second? That seems like enough to
>> get traction for an identifiably different header for an alternative
>> compression technique.

> The first of those words is vl_len_, which we can't fiddle with too
> much, but the second is rawsize, which we can definitely fiddle with.
> Right now, rawsize < vl_len_ means it's compressed; and rawsize ==
> vl_len_ means it's uncompressed. As you point out, rawsize > vl_len_
> is undefined; also, and maybe simpler, rawsize < 0 is undefined.

Well, let's please not make the same mistake again of assuming that
there will never again be any other ideas in this space. IOW, let's
find a way to shoehorn in an actual compression-method ID value of some
sort. I don't particularly care for trying to push that into rawsize,
because you don't really have more than about one bit to work with
there, unless you eat the entire word for ID purposes which seems
excessive.

After looking at pg_lzcompress.c for a bit, it appears to me that the
LSB of the first byte of compressed data must always be zero, because
the very first control bit has to say "copy a literal byte"; you can't
have a back-reference until there's some data in the output buffer.
So what I suggest is that we keep rawsize the same as it is, but peek at
the first byte after that to decide what we have: even means existing
compression method, an odd value is an ID byte selecting some new
method. This gives us room for 128 new methods before we have trouble
again, while consuming only one byte which seems like acceptable
overhead for the purpose.

regards, tom lane

In response to

Re: Faster compression, again at 2012-03-15 01:24:26 from Robert Haas

Responses

Re: Faster compression, again at 2012-03-15 02:12:24 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2012-03-15 02:12:24	Re: Faster compression, again
Previous Message	Robert Haas	2012-03-15 01:30:27	Re: Client Messages