Re: Zedstore - compressed in-core columnar storage

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Alexandra Wang <lewang(at)pivotal(dot)io>
Cc: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, DEV_OPS <devops(at)ww-it(dot)cn>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2019-08-29 12:09:33
Message-ID: ed9dfcfb-871f-f6e6-6463-4ab47b4cb273@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 29/08/2019 14:30, Ashutosh Sharma wrote:
>
> On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang(at)pivotal(dot)io
> <mailto:lewang(at)pivotal(dot)io>> wrote:
>
> You are correct that we currently go through each item in the leaf
> page that
> contains the given tid, specifically, the logic to retrieve all the
> attribute
> items inside a ZSAttStream is now moved to decode_attstream() in the
> latest
> code, and then in zsbt_attr_fetch() we again loop through each item we
> previously retrieved from decode_attstream() and look for the given
> tid.
>
>
> Okay. Any idea why this new way of storing attribute data as streams
> (lowerstream and upperstream) has been chosen just for the attributes
> but not for tids. Are only attribute blocks compressed but not the tids
> blocks?

Right, only attribute blocks are currently compressed. Tid blocks need
to be modified when there are UPDATEs or DELETE, so I think having to
decompress and recompress them would be more costly. Also, there is no
user data on the TID tree, and the Simple-8b encoded codewords used to
represent the TIDs are already pretty compact. I'm not sure how much
gain you would get from passing it through a general purpose compressor.

I could be wrong though. We could certainly try it out, and see how it
performs.

> One
> optimization we can to is to tell decode_attstream() to stop
> decoding at the
> tid we are interested in. We can also apply other tricks to speed up the
> lookups in the page, for fixed length attribute, it is easy to do
> binary search
> instead of linear search, and for variable length attribute, we can
> probably
> try something that we didn't think of yet.
>
>
> I think we can probably ask decode_attstream() to stop once it has found
> the tid that we are searching for but then we only need to do that for
> Index Scans.

I've been thinking that we should add a few "bookmarks" on long streams,
so that you could skip e.g. to the midpoint in a stream. It's a tradeoff
though; when you add more information for random access, it makes the
representation less compact.

> Zedstore currently implement update as delete+insert, hence the old
> tid is not
> reused. We don't store the tuple in our UNDO log, and we only store the
> transaction information in the UNDO log. Reusing the tid of the old
> tuple means
> putting the old tuple in the UNDO log, which we have not implemented
> yet.
>
> OKay, so that means performing update on a non-key attribute would also
> require changes in the index table. In short, HOT update is currently
> not possible with zedstore table. Am I right?

That's right. There's a lot of potential gain for doing HOT updates. For
example, if you UPDATE one column on every row on a table, ideally you
would only modify the attribute tree containing that column. But that
hasn't been implemented.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2019-08-29 12:10:06 Re: BUG #15977: Inconsistent behavior in chained transactions
Previous Message Ahsan Hadi 2019-08-29 11:47:31 Re: Email to hackers for test coverage