From: | Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Zedstore - compressed in-core columnar storage |
Date: | 2019-05-25 04:48:24 |
Message-ID: | 101f8490-a7bc-230e-cb38-730e26ca81bd@catalyst.net.nz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 23/05/19 12:07 PM, Ashwin Agrawal wrote:
>
> We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
> share the recent additions and modifications. Attaching a patch
> with the latest code. Link to github branch [1] to follow
> along. The approach we have been leaning towards is to build required
> functionality, get passing the test and then continue to iterate to
> optimize the same. It's still work-in-progress.
>
> Sharing the details now, as have reached our next milestone for
> Zedstore. All table AM API's are implemented for Zedstore (except
> compute_xid_horizon_for_tuples, seems need test for it first).
>
> Current State:
>
> - A new type of item added to Zedstore "Array item", to boost
> compression and performance. Based on Konstantin's performance
> experiments [2] and inputs from Tomas Vodra [3], this is
> added. Array item holds multiple datums, with consecutive TIDs and
> the same visibility information. An array item saves space compared
> to multiple single items, by leaving out repetitive UNDO and TID
> fields. An array item cannot mix NULLs and non-NULLs. So, those
> experiments should result in improved performance now. Inserting
> data via COPY creates array items currently. Code for insert has not
> been modified from last time. Making singleton inserts or insert
> into select, performant is still on the todo list.
>
> - Now we have a separate and dedicated meta-column btree alongside
> rest of the data column btrees. This special or first btree for
> meta-column is used to assign TIDs for tuples, track the UNDO
> location which provides visibility information. Also, this special
> btree, which always exists, helps to support zero-column tables
> (which can be a result of ADD COLUMN DROP COLUMN actions as
> well). Plus, having meta-data stored separately from data, helps to
> get better compression ratios. And also helps to further simplify
> the overall design/implementation as for deletes just need to edit
> the meta-column and avoid touching the actual data btrees. Index
> scans can just perform visibility checks based on this meta-column
> and fetch required datums only for visible tuples. For tuple locks
> also just need to access this meta-column only. Previously, every
> column btree used to carry the same undo pointer. Thus visibility
> check could be potentially performed, with the past layout, using
> any column. But considering overall simplification new layout
> provides it's fine to give up on that aspect. Having dedicated
> meta-column highly simplified handling for add columns with default
> and null values, as this column deterministically provides all the
> TIDs present in the table, which can't be said for any other data
> columns due to default or null values during add column.
>
> - Free Page Map implemented. The Free Page Map keeps track of unused
> pages in the relation. The FPM is also a b-tree, indexed by physical
> block number. To be more compact, it stores "extents", i.e. block
> ranges, rather than just blocks, when possible. An interesting paper
> [4] on
> how modern filesystems manage space acted as a good source for ideas.
>
> - Tuple locks implemented
>
> - Serializable isolation handled
>
> - With "default_table_access_method=zedstore"
> - 31 out of 194 failing regress tests
> - 10 out of 86 failing isolation tests
> Many of the current failing tests are due to plan differences, like
> Index scans selected for zedstore over IndexOnly scans, as zedstore
> doesn't yet have visibility map. I am yet to give a thought on
> index-only scans. Or plan diffs due to table size differences between
> heap and zedstore.
>
> Next few milestones we wish to hit for Zedstore:
> - Make check regress green
> - Make check isolation green
> - Zedstore crash safe (means also replication safe). Implement WAL
> logs
> - Performance profiling and optimizations for Insert, Selects, Index
> Scans, etc...
> - Once UNDO framework lands in Upstream, Zedstore leverages it instead
> of its own version of UNDO
>
> Open questions / discussion items:
>
> - how best to get "column projection list" from planner? (currently,
> we walk plan and find the columns required for the query in
> the executor, refer GetNeededColumnsForNode())
>
> - how to pass the "column projection list" to table AM? (as stated in
> initial email, currently we have modified table am API to pass the
> projection to AM)
>
> - TID treated as (block, offset) in current indexing code
>
> - Physical tlist optimization? (currently, we disabled it for
> zedstore)
>
> Team:
> Melanie joined Heikki and me to write code for zedstore. Majority of
> the code continues to be contributed by Heikki. We are continuing to
> have fun building column store implementation and iterate
> aggressively.
>
> References:
> 1] https://github.com/greenplum-db/postgres/tree/zedstore
> 2]
> https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
> 3]
> https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
> 4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
>
FWIW - building this against latest 12 beta1:
Loading and examining the standard pgbench schema (with the old names,
sorry) in v10 (standard heap_ and v12 (zedstore)
v10:
bench=# \i load.sql
COPY 100
Time: 16.335 ms
COPY 1000
Time: 16.748 ms
COPY 10000000
Time: 50276.230 ms (00:50.276)
bench=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+----------+-------+----------+------------+-------------
public | accounts | table | postgres | 1281 MB |
public | branches | table | postgres | 8192 bytes |
public | history | table | postgres | 0 bytes |
public | tellers | table | postgres | 72 kB |
v12+zedstore:
bench=# \i load.sql
COPY 100
Time: 0.656 ms
COPY 1000
Time: 3.573 ms
COPY 10000000
Time: 26244.832 ms (00:26.245)
bench=# \dt+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+----------+-------+----------+---------+-------------
public | accounts | table | postgres | 264 MB |
public | branches | table | postgres | 56 kB |
public | history | table | postgres | 0 bytes |
public | tellers | table | postgres | 64 kB |
So a good improvement in load times and on disk footprint! Also note
that I did not build with lz4 so looks like you guys have fixed the
quirks with compression making things bigger.
regards
Mark
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2019-05-25 06:55:11 | Re: Runtime pruning problem |
Previous Message | David Rowley | 2019-05-25 04:37:54 | Re: Performance issue in foreign-key-aware join estimation |