Quick Links

Re: Zedstore - compressed in-core columnar storage

From:	DEV_OPS <devops(at)ww-it(dot)cn>
To:	Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Zedstore - compressed in-core columnar storage
Date:	2019-05-30 15:07:30
Message-ID:	99fefcc5-8320-a5d1-5d6d-cbe339a2f151@ww-it.cn
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

it's really cool and very good progress,

I'm interesting if SIDM/JIT will be supported

best wishes

On 2019/5/23 08:07, Ashwin Agrawal wrote:
> We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
> share the recent additions and modifications. Attaching a patch
> with the latest code. Link to github branch [1] to follow
> along. The approach we have been leaning towards is to build required
> functionality, get passing the test and then continue to iterate to
> optimize the same. It's still work-in-progress.
>
> Sharing the details now, as have reached our next milestone for
> Zedstore. All table AM API's are implemented for Zedstore (except
> compute_xid_horizon_for_tuples, seems need test for it first).
>
> Current State:
>
> - A new type of item added to Zedstore "Array item", to boost
> compression and performance. Based on Konstantin's performance
> experiments [2] and inputs from Tomas Vodra [3], this is
> added. Array item holds multiple datums, with consecutive TIDs and
> the same visibility information. An array item saves space compared
> to multiple single items, by leaving out repetitive UNDO and TID
> fields. An array item cannot mix NULLs and non-NULLs. So, those
> experiments should result in improved performance now. Inserting
> data via COPY creates array items currently. Code for insert has not
> been modified from last time. Making singleton inserts or insert
> into select, performant is still on the todo list.
>
> - Now we have a separate and dedicated meta-column btree alongside
> rest of the data column btrees. This special or first btree for
> meta-column is used to assign TIDs for tuples, track the UNDO
> location which provides visibility information. Also, this special
> btree, which always exists, helps to support zero-column tables
> (which can be a result of ADD COLUMN DROP COLUMN actions as
> well). Plus, having meta-data stored separately from data, helps to
> get better compression ratios. And also helps to further simplify
> the overall design/implementation as for deletes just need to edit
> the meta-column and avoid touching the actual data btrees. Index
> scans can just perform visibility checks based on this meta-column
> and fetch required datums only for visible tuples. For tuple locks
> also just need to access this meta-column only. Previously, every
> column btree used to carry the same undo pointer. Thus visibility
> check could be potentially performed, with the past layout, using
> any column. But considering overall simplification new layout
> provides it's fine to give up on that aspect. Having dedicated
> meta-column highly simplified handling for add columns with default
> and null values, as this column deterministically provides all the
> TIDs present in the table, which can't be said for any other data
> columns due to default or null values during add column.
>
> - Free Page Map implemented. The Free Page Map keeps track of unused
> pages in the relation. The FPM is also a b-tree, indexed by physical
> block number. To be more compact, it stores "extents", i.e. block
> ranges, rather than just blocks, when possible. An interesting paper [4]
> on
> how modern filesystems manage space acted as a good source for ideas.
>
> - Tuple locks implemented
>
> - Serializable isolation handled
>
> - With "default_table_access_method=zedstore"
> - 31 out of 194 failing regress tests
> - 10 out of 86 failing isolation tests
> Many of the current failing tests are due to plan differences, like
> Index scans selected for zedstore over IndexOnly scans, as zedstore
> doesn't yet have visibility map. I am yet to give a thought on
> index-only scans. Or plan diffs due to table size differences between
> heap and zedstore.
>
> Next few milestones we wish to hit for Zedstore:
> - Make check regress green
> - Make check isolation green
> - Zedstore crash safe (means also replication safe). Implement WAL
> logs
> - Performance profiling and optimizations for Insert, Selects, Index
> Scans, etc...
> - Once UNDO framework lands in Upstream, Zedstore leverages it instead
> of its own version of UNDO
>
> Open questions / discussion items:
>
> - how best to get "column projection list" from planner? (currently,
> we walk plan and find the columns required for the query in
> the executor, refer GetNeededColumnsForNode())
>
> - how to pass the "column projection list" to table AM? (as stated in
> initial email, currently we have modified table am API to pass the
> projection to AM)
>
> - TID treated as (block, offset) in current indexing code
>
> - Physical tlist optimization? (currently, we disabled it for
> zedstore)
>
> Team:
> Melanie joined Heikki and me to write code for zedstore. Majority of
> the code continues to be contributed by Heikki. We are continuing to
> have fun building column store implementation and iterate
> aggressively.
>
> References:
> 1] https://github.com/greenplum-db/postgres/tree/zedstore
> 2]
> https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
> 3]
> https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
> 4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
>

In response to

Re: Zedstore - compressed in-core columnar storage at 2019-05-23 00:07:45 from Ashwin Agrawal

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tomas Vondra	2019-05-30 15:10:29	Re: error messages in extended statistics
Previous Message	Alvaro Herrera	2019-05-30 14:22:15	Re: coverage increase for worker_spi