Quick Links

Re: Faster compression, again

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Daniel Farina <daniel(at)heroku(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Faster compression, again
Date:	2012-03-15 02:12:24
Message-ID:	CA+Tgmob5P+hY-tNLB5jfuehASMCa38NswRo6tMc0B=Mk0k9TtA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 14, 2012 at 9:44 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Well, let's please not make the same mistake again of assuming that
> there will never again be any other ideas in this space. IOW, let's
> find a way to shoehorn in an actual compression-method ID value of some
> sort. I don't particularly care for trying to push that into rawsize,
> because you don't really have more than about one bit to work with
> there, unless you eat the entire word for ID purposes which seems
> excessive.

Well, you don't have to go that far. For example, you could dictate
that, when the value is negative, the most significant byte indicates
the compression algorithm is in use (128 possible compression
algorithms). The remaining 3 bytes indicate the compressed length;
but when they're all zero, the compressed length is instead stored in
the following 4-byte word. This consumes one additional 4-byte word
for values that take >= 16MB compressed, but that's presumably a
non-problem.

> After looking at pg_lzcompress.c for a bit, it appears to me that the
> LSB of the first byte of compressed data must always be zero, because
> the very first control bit has to say "copy a literal byte"; you can't
> have a back-reference until there's some data in the output buffer.
> So what I suggest is that we keep rawsize the same as it is, but peek at
> the first byte after that to decide what we have: even means existing
> compression method, an odd value is an ID byte selecting some new
> method. This gives us room for 128 new methods before we have trouble
> again, while consuming only one byte which seems like acceptable
> overhead for the purpose.

That would work, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Faster compression, again at 2012-03-15 01:44:56 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2012-03-15 02:15:22	Re: foreign key locks, 2nd attempt
Previous Message	Tom Lane	2012-03-15 01:44:56	Re: Faster compression, again