Re: Zedstore - compressed in-core columnar storage

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Alexandra Wang <lewang(at)pivotal(dot)io>
Cc: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2019-08-20 11:12:32
Message-ID: ba7fcda3-9b7a-3ccf-d486-bd02070d482f@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 20/08/2019 05:04, Justin Pryzby wrote:
>>> it looks like zedstore
>>> with lz4 gets ~4.6x for our largest customer's largest table. zfs using
>>> compress=gzip-1 gives 6x compression across all their partitioned
>>> tables,
>>> and I'm surprised it beats zedstore .

I did a quick test, with 10 million random IP addresses, in text format.
I loaded it into a zedstore table ("create table ips (ip text) using
zedstore"), and poked around a little bit to see how the space is used.

postgres=# select lokey, nitems, ncompressed, totalsz, uncompressedsz,
freespace from pg_zs_btree_pages('ips') where attno=1 and level=0 limit 10;
lokey | nitems | ncompressed | totalsz | uncompressedsz | freespace
-------+--------+-------------+---------+----------------+-----------
1 | 4 | 4 | 6785 | 7885 | 1320
537 | 5 | 5 | 7608 | 8818 | 492
1136 | 4 | 4 | 6762 | 7888 | 1344
1673 | 5 | 5 | 7548 | 8776 | 540
2269 | 4 | 4 | 6841 | 7895 | 1256
2807 | 5 | 5 | 7555 | 8784 | 540
3405 | 5 | 5 | 7567 | 8772 | 524
4001 | 4 | 4 | 6791 | 7899 | 1320
4538 | 5 | 5 | 7596 | 8776 | 500
5136 | 4 | 4 | 6750 | 7875 | 1360
(10 rows)

There's on average about 10% of free space on the pages. We're losing
quite a bit to to ZFS compression right there. I'm sure there's some
free space on the heap pages as well, but ZFS compression will squeeze
it out.

The compression ratio is indeed not very good. I think one reason is
that zedstore does LZ4 in relatively small chunks, while ZFS surely
compresses large blocks in one go. Looking at the above, there is on
average 125 datums packed into each "item" (avg(hikey-lokey) / nitems).
I did a quick test with the "lz4" command-line utility, compressing flat
files containing random IP addresses.

$ lz4 /tmp/125-ips.txt
Compressed filename will be : /tmp/125-ips.txt.lz4
Compressed 1808 bytes into 1519 bytes ==> 84.02%

$ lz4 /tmp/550-ips.txt
Compressed filename will be : /tmp/550-ips.txt.lz4
Compressed 7863 bytes into 6020 bytes ==> 76.56%

$ lz4 /tmp/750-ips.txt
Compressed filename will be : /tmp/750-ips.txt.lz4
Compressed 10646 bytes into 8035 bytes ==> 75.47%

The first case is roughly what we do in zedstore currently: we compress
about 125 datums as one chunk. The second case is roughty what we would
get, if we collected on 8k worth of datums and compressed them all as
one chunk. And the third case simulates the case we would allow the
input to be larger than 8k, so that the compressed chunk just fits on an
8k page. Not too much difference between the second and third case, but
its pretty clear that we're being hurt by splitting the input into such
small chunks.

The downside of using a larger compression chunk size is that random
access becomes more expensive. Need to give the on-disk format some more
thought. Although I actually don't feel too bad about the current
compression ratio, perfect can be the enemy of good.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2019-08-20 11:41:38 Re: POC: Cleaning up orphaned files using undo logs
Previous Message Pavel Stehule 2019-08-20 10:34:13 Re: Make SQL/JSON error code names match SQL standard