Re: pluggable compression support

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Subject: Re: pluggable compression support
Date: 2013-06-25 16:22:31
Message-ID: CA+TgmoZdWLNS-+61woU68GqFFS47y837tkWAm2_jB5S3b594nQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 20, 2013 at 8:09 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-06-15 12:20:28 +0200, Andres Freund wrote:
>> On 2013-06-14 21:56:52 -0400, Robert Haas wrote:
>> > I don't think we need it. I think what we need is to decide is which
>> > algorithm is legally OK to use. And then put it in.
>> >
>> > In the past, we've had a great deal of speculation about that legal
>> > question from people who are not lawyers. Maybe it would be valuable
>> > to get some opinions from people who ARE lawyers. Tom and Heikki both
>> > work for real big companies which, I'm guessing, have substantial
>> > legal departments; perhaps they could pursue getting the algorithms of
>> > possible interest vetted. Or, I could try to find out whether it's
>> > possible do something similar through EnterpriseDB.
>>
>> I personally don't think the legal arguments holds all that much water
>> for snappy and lz4. But then the opinion of a european non-lawyer doesn't
>> hold much either.
>> Both are widely used by a large number open and closed projects, some of
>> which have patent grant clauses in their licenses. E.g. hadoop,
>> cassandra use lz4, and I'd be surprised if the companies behind those
>> have opened themselves to litigation.
>>
>> I think we should preliminarily decide which algorithm to use before we
>> get lawyers involved. I'd surprised if they can make such a analysis
>> faster than we can rule out one of them via benchmarks.
>>
>> Will post an updated patch that includes lz4 as well.
>
> Attached.
>
> Changes:
> * add lz4 compression algorithm (2 clause bsd)
> * move compression algorithms into own subdirectory
> * clean up compression/decompression functions
> * allow 258 compression algorithms, uses 1byte extra for any but the
> first three
> * don't pass a varlena to pg_lzcompress.c anymore, but data directly
> * add pglz_long as a test fourth compression method that uses the +1
> byte encoding
> * us postgres' endian detection in snappy for compatibility with osx
>
> Based on the benchmarks I think we should go with lz4 only for now. The
> patch provides the infrastructure should somebody else want to add more
> or even proper configurability.
>
> Todo:
> * windows build support
> * remove toast_compression_algo guc
> * remove either snappy or lz4 support
> * remove pglz_long support (just there for testing)
>
> New benchmarks:
>
> Table size:
> List of relations
> Schema | Name | Type | Owner | Size | Description
> --------+--------------------+-------+--------+--------+-------------
> public | messages_pglz | table | andres | 526 MB |
> public | messages_snappy | table | andres | 523 MB |
> public | messages_lz4 | table | andres | 522 MB |
> public | messages_pglz_long | table | andres | 527 MB |
> (4 rows)
>
> Workstation (2xE5520, enough s_b for everything):
>
> Data load:
> pglz: 36643.384 ms
> snappy: 24626.894 ms
> lz4: 23871.421 ms
> pglz_long: 37097.681 ms
>
> COPY messages_* TO '/dev/null' WITH BINARY;
> pglz: 3116.083 ms
> snappy: 2524.388 ms
> lz4: 2349.396 ms
> pglz_long: 3104.134 ms
>
> COPY (SELECT rawtxt FROM messages_*) TO '/dev/null' WITH BINARY;
> pglz: 1609.969 ms
> snappy: 1031.696 ms
> lz4: 886.782 ms
> pglz_long: 1606.803 ms
>
>
> On my elderly laptop (core 2 duo), too load shared buffers:
>
> Data load:
> pglz: 39968.381 ms
> snappy: 26952.330 ms
> lz4: 29225.472 ms
> pglz_long: 39929.568 ms
>
> COPY messages_* TO '/dev/null' WITH BINARY;
> pglz: 3920.588 ms
> snappy: 3421.938 ms
> lz4: 3311.540 ms
> pglz_long: 3885.920 ms
>
> COPY (SELECT rawtxt FROM messages_*) TO '/dev/null' WITH BINARY;
> pglz: 2238.145 ms
> snappy: 1753.403 ms
> lz4: 1638.092 ms
> pglz_long: 2227.804 ms

Well, the performance of both snappy and lz4 seems to be significantly
better than pglz. On these tests lz4 has a small edge but that might
not be true on other data sets. I still think the main issue is legal
review: are there any license or patent concerns about including
either of these algorithms in PG? If neither of them have issues, we
might need to experiment a little more before picking between them.
If one does and the other does not, well, then it's a short
conversation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-25 16:38:47 Re: [PATCH] add long options to pgbench (submission 1)
Previous Message Bruce Momjian 2013-06-25 16:21:59 Re: Hash partitioning.