Quick Links

Re: Zedstore - compressed in-core columnar storage

From:	Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
To:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Zedstore - compressed in-core columnar storage
Date:	2019-05-23 00:07:45
Message-ID:	CALfoeiuuLe12PuQ=zvM_L7B5qegBqGHYENHUGbLOsjAnG=mi4A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
share the recent additions and modifications. Attaching a patch
with the latest code. Link to github branch [1] to follow
along. The approach we have been leaning towards is to build required
functionality, get passing the test and then continue to iterate to
optimize the same. It's still work-in-progress.

Sharing the details now, as have reached our next milestone for
Zedstore. All table AM API's are implemented for Zedstore (except
compute_xid_horizon_for_tuples, seems need test for it first).

Current State:

- A new type of item added to Zedstore "Array item", to boost
compression and performance. Based on Konstantin's performance
experiments [2] and inputs from Tomas Vodra [3], this is
added. Array item holds multiple datums, with consecutive TIDs and
the same visibility information. An array item saves space compared
to multiple single items, by leaving out repetitive UNDO and TID
fields. An array item cannot mix NULLs and non-NULLs. So, those
experiments should result in improved performance now. Inserting
data via COPY creates array items currently. Code for insert has not
been modified from last time. Making singleton inserts or insert
into select, performant is still on the todo list.

- Now we have a separate and dedicated meta-column btree alongside
rest of the data column btrees. This special or first btree for
meta-column is used to assign TIDs for tuples, track the UNDO
location which provides visibility information. Also, this special
btree, which always exists, helps to support zero-column tables
(which can be a result of ADD COLUMN DROP COLUMN actions as
well). Plus, having meta-data stored separately from data, helps to
get better compression ratios. And also helps to further simplify
the overall design/implementation as for deletes just need to edit
the meta-column and avoid touching the actual data btrees. Index
scans can just perform visibility checks based on this meta-column
and fetch required datums only for visible tuples. For tuple locks
also just need to access this meta-column only. Previously, every
column btree used to carry the same undo pointer. Thus visibility
check could be potentially performed, with the past layout, using
any column. But considering overall simplification new layout
provides it's fine to give up on that aspect. Having dedicated
meta-column highly simplified handling for add columns with default
and null values, as this column deterministically provides all the
TIDs present in the table, which can't be said for any other data
columns due to default or null values during add column.

- Free Page Map implemented. The Free Page Map keeps track of unused
pages in the relation. The FPM is also a b-tree, indexed by physical
block number. To be more compact, it stores "extents", i.e. block
ranges, rather than just blocks, when possible. An interesting paper [4]
on
how modern filesystems manage space acted as a good source for ideas.

- Tuple locks implemented

- Serializable isolation handled

- With "default_table_access_method=zedstore"
- 31 out of 194 failing regress tests
- 10 out of 86 failing isolation tests
Many of the current failing tests are due to plan differences, like
Index scans selected for zedstore over IndexOnly scans, as zedstore
doesn't yet have visibility map. I am yet to give a thought on
index-only scans. Or plan diffs due to table size differences between
heap and zedstore.

Next few milestones we wish to hit for Zedstore:
- Make check regress green
- Make check isolation green
- Zedstore crash safe (means also replication safe). Implement WAL
logs
- Performance profiling and optimizations for Insert, Selects, Index
Scans, etc...
- Once UNDO framework lands in Upstream, Zedstore leverages it instead
of its own version of UNDO

Open questions / discussion items:

- how best to get "column projection list" from planner? (currently,
we walk plan and find the columns required for the query in
the executor, refer GetNeededColumnsForNode())

- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)

- TID treated as (block, offset) in current indexing code

- Physical tlist optimization? (currently, we disabled it for
zedstore)

Team:
Melanie joined Heikki and me to write code for zedstore. Majority of
the code continues to be contributed by Heikki. We are continuing to
have fun building column store implementation and iterate
aggressively.

References:
1] https://github.com/greenplum-db/postgres/tree/zedstore
2]
https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
3]
https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf

Attachment	Content-Type	Size
v2-0001-Zedstore-compressed-in-core-columnar-storage.patch	text/x-patch	786.3 KB

In response to

Re: Zedstore - compressed in-core columnar storage at 2019-04-25 06:44:45 from Heikki Linnakangas

Responses

Re: Zedstore - compressed in-core columnar storage at 2019-05-24 02:30:19 from Ajin Cherian
Re: Zedstore - compressed in-core columnar storage at 2019-05-25 04:48:24 from Mark Kirkwood
Re: Zedstore - compressed in-core columnar storage at 2019-05-30 15:07:30 from DEV_OPS
Re: Zedstore - compressed in-core columnar storage at 2019-07-01 19:14:37 from Ashwin Agrawal

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2019-05-23 00:11:33	Re: pg_dump throwing "column number -1 is out of range 0..36" on HEAD
Previous Message	Tom Lane	2019-05-22 22:57:53	Suppressing noise in successful check-world runs