Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...

From: JanWieck(at)t-online(dot)de (Jan Wieck)
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL HACKERS <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [SQL] Re: [GENERAL] lztext and compression ratios...
Date: 2000-07-10 15:11:51
Message-ID: 200007101511.RAA11221@hot.jw.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers pgsql-sql

Tom Lane wrote:
> JanWieck(at)t-online(dot)de (Jan Wieck) writes:
> > eisentrp(at)csis(dot)gvsu(dot)edu wrote:
> >> Maybe you just want to use zlib. Let other guys hammer out the details.
>
> > We cannot assume that zlib is available everywhere.
>
> We can if we include it in our distribution --- which we could; it's
> pretty small and uses a BSD-style license. I can assure you the zlib
> guys would be happy with that. And it's certainly as portable as our
> own code. The real question is, is a custom compressor enough better
> than zlib for our purposes to make it worth taking any patent risks?

Good, we shouldn't worry about that anymore. If we want to
use zlib, I vote for including it into our distribution and
link static against the one shipped with our code.

If we want to ...

> We could run zlib at a low compression setting (-z1 to -z3 maybe)
> to make compression relatively fast, and since that also doesn't
> generate a custom Huffman tree, the overhead in the compressed data
> is minor even for short strings. And its memory footprint is
> certainly no worse than Jan's method...

Definitely not, it's memory footprint is actually much
smaller. Thus, I need to recreate the comparision below
again after making the history table of fixed size with a
wrap around mechanism to get a small footprint on multi-MB
inputs too.

> The real question is whether zlib decompression is markedly slower
> than Jan's code. Certainly Jan's method is a lot simpler and *should*
> be faster --- but on the other hand, zlib has had a heck of a lot
> of careful performance tuning put into it over the years. The speed
> difference might not be as bad as all that.
>
> I think it's worth taking a look at the option.

Some quick numbers though:

I simply stripped down pg_lzcompress.c to call compress2()
and uncompress() instead of doing anything itself (what a
nice, small source file :-). There might be some room for
improvement using static zlib stream allocaions and
deflateReset(), inflateReset() or the like. But I don't
expect a significant difference from that.

The test is a Tcl (pgtclsh) script doing the following:

- Loading 151 HTML files into a table t1 of structure (path
text, content lztext).

- SELECT * FROM t1 and checking for correct result set.
Each file is read again during the check.

- UPDATE t1 SET content = upper(content).

­ SELECT * FROM t1 and checking for correct result set.
Each file is read again, converted to upper case using
Tcl's "string toupper" function for comparision.

- SELECT path FROM t1. Loop over result set to UPDATE t1
SET content = <value> WHERE path = <path>. All files are
read again and converted to lower case before UPDATE.

- SELECT * FROM t1 and check for correct result set. Files
are again reread and lower case converted in Tcl for
comparision.

- Doing 20 SELECT * FROM t1 to have alot more decompress
than compress cycles.

Of course, there's an index on path. Here are the timings and
sizes:

Compressor | level | heap size | toastrel | toastidx | seconds
| | | size | size |
-----------+-------+-----------+----------+----------+--------
PGLZ | - | 425,984 | 950,272 | 32,768 | 5.20
zlib | 1 | 499,712 | 614,400 | 16,384 | 6.85
zlib | 3 | 499,712 | 557,056 | 16,384 | 6.75
zlib | 6 | 491,520 | 524,288 | 16,384 | 7.10
zlib | 9 | 491,520 | 524,288 | 16,384 | 7.21

Seconds is an average over multiple runs. Interesting is that
compression level 3 seems to be faster than 1. I double
checked it because it was so surprising.

Also, increasing the number of SELECT * at the end increases
the difference. So the PGLZ decompressor does a perfect job.

And what must be taken into account too is that the script,
running on the same processor and doing all the overhead
(reading files, doing case conversions, quoting values with
regsub and comparisions), along with the normal Postgres
query execution (parsing, planning, optimizing, execution)
occupies a substantial portion of the bare runtime. Still
PGLZ is about 25% faster than the best zlib compression level
I'm seeing, while zlib gains a much better compression ratio
(factor 1.7 at least).

As I see it:

If replacing the compressor/decompressor can cause a runtime
difference of 25% in such a scenario, the pure difference
between the two methods must be alot.

PGLZ is what I mentioned in the comments. Optimized for speed
on the cost of compression ratio.

What I suggest:

Leave PGLZ in place as the default compressor for toastable
types. Speed is what all benchmarks talk about - on disk
storage size is seldom a minor note.

Fix it's history allocation for huge values and have someone
(PgSQL Inc.?) patenting the compression algorithm, so we're
safe at some point in the future. If there's a patent problem
in it, we are already running the risk to get sued, the PGLZ
code got shipped with 7.0, used in lztext.

We can discuss about enabling zlib as a per attribute
configurable alternative further. But is the confusion this
might cause worth it all?

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Thomas Lockhart 2000-07-10 15:39:54 Re: Re: [GENERAL] PostgreSQL vs. MySQL
Previous Message Tom Lane 2000-07-10 15:05:32 Re: [NOVICE] newbie problem on creating table

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2000-07-10 15:14:55 Re: Re: postgres TODO
Previous Message eisentrp 2000-07-10 15:06:31 Re: Templates

Browse pgsql-sql by date

  From Date Subject
Next Message Dorin Grunberg 2000-07-10 17:21:10 Subselects with IN and EXISTS
Previous Message Thomas Lockhart 2000-07-10 14:24:32 Re: Type conversion