Re: [PATCH] Compression and on-disk sorting

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCH] Compression and on-disk sorting
Date: 2006-05-17 17:38:47
Message-ID: 1147887527.2646.323.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

On Wed, 2006-05-17 at 18:17 +0200, Martijn van Oosterhout wrote:
> Persuant to the discussions currently on -hackers, here's a patch that
> uses zlib to compress the tapes as they go to disk. I default to the
> compression level 3 (think gzip -3).
>
> Please speed test all you like, I *think* it's bug free, but you never
> know.
>
> Outstanding questions:
>
> - I use zlib because the builtin pg_lzcompress can't do what zlib does.
> Here we setup input and output buffers and zlib will process as much as
> it can (input empty or output full). This means no marshalling is
> required. We can compress the whole file without having it in memory.

Licence is BSD-compatible and it works the way we need it to work.

> - Each tape is compressed as one long compressed stream. Currently no
> seeking is allowed, so only sorts, no joins! (As tom said, quick and
> dirty numbers). This should show this possibility in its best light
> but if we want to support seeking we're going to need to change that.
> Maybe no compression on the last pass?

We should be able to do this without significant loss of compression by
redefining the lts block size to be 32k. That's the size of the
look-back window anyhow, so compressing the whole stream doesn't get us
much more.

> - It's probable that the benefits are strongly correlated to the speed
> of your disk subsystem. We need to measure this effect. I can't
> accuratly measure this because my compiler doesn't inline any of the
> functions in tuplesort.c.

Please make sure any tests have trace_sort = on.

> In my test of a compression ratio around 100-to-1, on 160MB of data
> with tiny work_mem on my 5 year old laptop, it speeds it up by 60% so
> it's obviously not a complete waste of time. Ofcourse, YMMV :)

Sounds good. Well done.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2006-05-17 18:31:17 Re: [GENERAL] Querying libpq compile time options
Previous Message Larry Rosenman 2006-05-17 17:33:50 Re: [GENERAL] Querying libpq compile time options

Browse pgsql-patches by date

  From Date Subject
Next Message Tom Lane 2006-05-17 19:30:54 Re: Compression and on-disk sorting
Previous Message Greg Stark 2006-05-17 17:01:11 Re: Compression and on-disk sorting