Re: Zedstore - compressed in-core columnar storage

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Soumyadeep Chakraborty <soumyadeep2007(at)gmail(dot)com>, Alexandra Wang <lewang(at)pivotal(dot)io>
Cc: Taylor Vesely <tvesely(at)pivotal(dot)io>, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, DEV_OPS <devops(at)ww-it(dot)cn>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2020-11-12 22:40:30
Message-ID: 9e1077bc-c5c1-425b-07ad-424d3522363f@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thanks for the updated patch. It's a quite massive amount of code - I I
don't think we had many 2MB patches in the past, so this is by no means
a full review.

1) the psql_1.out is missing a bit of expected output (due to 098fb0079)

2) I'm getting crashes in intarray contrib, due to hitting this error in
lwlock.c (backtrace attached):

/* Ensure we will have room to remember the lock */
if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
elog(ERROR, "too many LWLocks taken");

I haven't investigates this too much, but it's regular build with
asserts and TAP tests, so it should be simple to reproduce using "make
check-world" I guess.

3) I did a very simple benchmark, loading a TPC-H data (for 75GB),
followed by pg_dump, and the duration (in seconds) looks like this:

master zedstore/pglz zedstore/lz4
-------------------------------------------------
copy 1855 68092 2131
dump 751 905 811

And the size of the lineitem table (as shown by \d+) is:

master: 64GB
zedstore/pglz: 51GB
zedstore/lz4: 20GB

It's mostly expected lz4 beats pglz in performance and compression
ratio, but this seems a bit too extreme I guess. Per past benchmarks
(e.g. [1] and [2]) the difference in compression/decompression time
should be maybe 1-2x or something like that, not 35x like here.

[1]
https://www.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de

[2]
https://www.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de

Furthermore, the pglz compression is not consuming the most CPU, at
least that's what perf says:

24.82% postgres [.] encode_chunk_varlen
20.49% postgres [.] decode_chunk
13.01% postgres [.] merge_attstream_guts.isra.0
12.68% libc-2.32.so [.] __memmove_avx_unaligned_erms
8.72% postgres [.] encode_chunk_fixed
6.16% postgres [.] pglz_compress
4.36% postgres [.] decode_attstream_cont
2.27% postgres [.] 0x00000000000baff0
1.84% postgres [.] AllocSetAlloc
0.79% postgres [.] append_attstream
0.70% postgres [.] palloc

So I wonder if this is a sign of a deeper issue - maybe the lower
compression ratio (for pglz) triggers some sort of feedback loop in
zedstore, or something like that? Not sure, but this seems strange.

4) I looked at some of the code, like merge_attstream etc. and I wonder
if this might be related to some of the FIXME comments. For example this
bit in merge_attstream seems interesting:

* FIXME: we don't actually pay attention to the compression anymore.
* We never repack.
* FIXME: this is backwords, the normal fast path is if (firsttid1 >
lasttid2)

But I suppose that should affect both pglz and lz4, and I'm not sure how
up to date those comments actually are.

BTW the comments in general need updating and tidying up, to make
reviews easier. For example the merge_attstream comment references
attstream1 and attstream2, but those are not the current parameters of
the function.

5) IHMO there should be a #define specifying the maximum number of items
per chunk (60). Currently there are literal constants used in various
places, sometimes 60, sometimes 59 etc. which makes it harder to
understand the code. FWIW 60 seems a bit low, but maybe it's OK.

6) I do think ZSAttStream should track which compression is used by the
stream, for two main reasons. Firstly, there's another patch to support
"custom compression" methods, which (also) allows multiple compression
methods per column. It'd be a bit strange to support that for varlena
columns in heap table, and not here, I guess. Secondly, I think one of
the interesting columnstore features down the road will be execution on
compressed data, which however requires compression method designed for
that purpose, and it's often datatype-specific (delta encoding, ...).

I don't think we need to go as far as supporting "custom" compression
methods here, but I think we should allow different built-in compression
methods for different attstreams.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
crash-zedstore.txt text/plain 4.7 KB
perf.txt text/plain 8.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vik Fearing 2020-11-12 22:44:07 Re: Implement <null treatment> for window functions
Previous Message Krasiyan Andreev 2020-11-12 22:35:20 Re: Implement <null treatment> for window functions