Re: Compression and on-disk sorting

From: Andrew Piskorski <atp(at)piskorski(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 00:14:34
Message-ID: 20060517001434.GA3222@tehun.pair.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:
> > Rod Taylor wrote:
> > >>I habitually turn off all compression on my Windows boxes, because it's
> > >>a performance hit in my experience. Disk is cheap ...
> > >
> > >Disk storage is cheap. Disk bandwidth or throughput is very expensive.

> > Sure, but in my experience using Windows File System compression is not
> > a win here. Presumably if it were an unqualified win they would have it

> Does anyone have time to hack some kind of compression into the on-disk
> sort code just to get some benchmark numbers? Unfortunately, doing so is
> beyond my meager C abilitiy...

Folks, first of all, I'm in no way an expert on data compression in
RDBMSs, but other databases DO include data compression features and
claim it as a SIGNIFICANT win in I/O reduction.

Looking at performance of the Windows File System compression, etc.,
doesn't make too much sense when there are actual RDBMS compression
implementations to compare to, on the commerical market, in open
source code, and in the academic literature.

Oracle has included "table compression" since 9iR2. They report table
size reductions of 2x to 4x as typical, with proportional reductions
in I/O, and supposedly, usually low to negligible overhead for writes:

http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14211/build_db.htm#sthref289

Decision Speed: Table Compression In Action by Meikel Poess and Hermann Baer (2003):
http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/poess_tablecomp.html

Compressing Data for Space and Speed by Sanjay Mishra (2004):
http://www.oracle.com/technology/oramag/oracle/04-mar/o24tech_data.html

Order For Maximum Compression:
http://oramossoracle.blogspot.com/2005/11/table-compression-order-for-maximum.html

I don't remember whether the current (Open Source) MonetDB includes
table compression or not, but they've published papers with lots of
interesting detail on the compression and other high performance OLAP
features in their latest (not released) "X100" MoneyDB research
codebase:

http://monetdb.cwi.nl/
http://homepages.cwi.nl/~mk/MonetDB/
http://sourceforge.net/projects/monetdb/
ftp://ftp.research.microsoft.com/pub/debull/A05june/issue1.htm

Now, the docs and papers above are all focused on query performance,
they say nothing directly about using using compression for on-disk
sorts. But, I'd bet that similar rules of thumb will apply in both
cases.

The main tricks seem to be: One, EXTREMELY lightweight compression
schemes - basically table lookups designed to be as cpu friendly as
posible. Two, keep the data compressed in RAM as well so that you can
also cache more of the data, and indeed keep it the compressed until
as late in the CPU processing pipeline as possible.

A corrolary of that is forget compression schemes like gzip - it
reduces data size nicely but is far too slow on the cpu to be
particularly useful in improving overall throughput rates.

Note, I have not really tested ANY of the above myself, your mileage
may well vary from what I recall from those various articles...

--
Andrew Piskorski <atp(at)piskorski(dot)com>
http://www.piskorski.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2006-05-17 00:51:10 Re: PL/pgSQL 'i = i + 1' Syntax
Previous Message David Wheeler 2006-05-17 00:07:25 Re: PL/pgSQL 'i = i + 1' Syntax

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2006-05-17 02:18:30 Re: [HACKERS] .pgpass file and unix domain sockets
Previous Message Tom Lane 2006-05-16 22:48:33 Re: Compression and on-disk sorting