Re: Zedstore - compressed in-core columnar storage

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2019-05-25 04:48:24
Message-ID: 101f8490-a7bc-230e-cb38-730e26ca81bd@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 23/05/19 12:07 PM, Ashwin Agrawal wrote:
>
> We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
> share the recent additions and modifications. Attaching a patch
> with the latest code. Link to github branch [1] to follow
> along. The approach we have been leaning towards is to build required
> functionality, get passing the test and then continue to iterate to
> optimize the same. It's still work-in-progress.
>
> Sharing the details now, as have reached our next milestone for
> Zedstore. All table AM API's are implemented for Zedstore (except
> compute_xid_horizon_for_tuples, seems need test for it first).
>
> Current State:
>
> - A new type of item added to Zedstore "Array item", to boost
>   compression and performance. Based on Konstantin's performance
>   experiments [2] and inputs from Tomas Vodra [3], this is
>   added. Array item holds multiple datums, with consecutive TIDs and
>   the same visibility information. An array item saves space compared
>   to multiple single items, by leaving out repetitive UNDO and TID
>   fields. An array item cannot mix NULLs and non-NULLs. So, those
>   experiments should result in improved performance now. Inserting
>   data via COPY creates array items currently. Code for insert has not
>   been modified from last time. Making singleton inserts or insert
>   into select, performant is still on the todo list.
>
> - Now we have a separate and dedicated meta-column btree alongside
>   rest of the data column btrees. This special or first btree for
>   meta-column is used to assign TIDs for tuples, track the UNDO
>   location which provides visibility information. Also, this special
>   btree, which always exists, helps to support zero-column tables
>   (which can be a result of ADD COLUMN DROP COLUMN actions as
>   well). Plus, having meta-data stored separately from data, helps to
>   get better compression ratios. And also helps to further simplify
>   the overall design/implementation as for deletes just need to edit
>   the meta-column and avoid touching the actual data btrees. Index
>   scans can just perform visibility checks based on this meta-column
>   and fetch required datums only for visible tuples. For tuple locks
>   also just need to access this meta-column only. Previously, every
>   column btree used to carry the same undo pointer. Thus visibility
>   check could be potentially performed, with the past layout, using
>   any column. But considering overall simplification new layout
>   provides it's fine to give up on that aspect. Having dedicated
>   meta-column highly simplified handling for add columns with default
>   and null values, as this column deterministically provides all the
>   TIDs present in the table, which can't be said for any other data
>   columns due to default or null values during add column.
>
> - Free Page Map implemented. The Free Page Map keeps track of unused
>   pages in the relation. The FPM is also a b-tree, indexed by physical
>   block number. To be more compact, it stores "extents", i.e. block
>   ranges, rather than just blocks, when possible. An interesting paper
> [4] on
>   how modern filesystems manage space acted as a good source for ideas.
>
> - Tuple locks implemented
>
> - Serializable isolation handled
>
> - With "default_table_access_method=zedstore"
>   - 31 out of 194 failing regress tests
>   - 10 out of 86 failing isolation tests
> Many of the current failing tests are due to plan differences, like
> Index scans selected for zedstore over IndexOnly scans, as zedstore
> doesn't yet have visibility map. I am yet to give a thought on
> index-only scans. Or plan diffs due to table size differences between
> heap and zedstore.
>
> Next few milestones we wish to hit for Zedstore:
> - Make check regress green
> - Make check isolation green
> - Zedstore crash safe (means also replication safe). Implement WAL
>   logs
> - Performance profiling and optimizations for Insert, Selects, Index
>   Scans, etc...
> - Once UNDO framework lands in Upstream, Zedstore leverages it instead
>   of its own version of UNDO
>
> Open questions / discussion items:
>
> - how best to get "column projection list" from planner? (currently,
>   we walk plan and find the columns required for the query in
>   the executor, refer GetNeededColumnsForNode())
>
> - how to pass the "column projection list" to table AM? (as stated in
>   initial email, currently we have modified table am API to pass the
>   projection to AM)
>
> - TID treated as (block, offset) in current indexing code
>
> - Physical tlist optimization? (currently, we disabled it for
>   zedstore)
>
> Team:
> Melanie joined Heikki and me to write code for zedstore. Majority of
> the code continues to be contributed by Heikki. We are continuing to
> have fun building column store implementation and iterate
> aggressively.
>
> References:
> 1] https://github.com/greenplum-db/postgres/tree/zedstore
> 2]
> https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
> 3]
> https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
> 4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
>

FWIW - building this against latest 12 beta1:

Loading and examining the standard pgbench schema (with the old names,
sorry) in v10 (standard heap_ and v12 (zedstore)

v10:

bench=# \i load.sql
COPY 100
Time: 16.335 ms
COPY 1000
Time: 16.748 ms
COPY 10000000
Time: 50276.230 ms (00:50.276)
bench=# \dt+
                        List of relations
 Schema |   Name   | Type  |  Owner   |    Size    | Description
--------+----------+-------+----------+------------+-------------
 public | accounts | table | postgres | 1281 MB    |
 public | branches | table | postgres | 8192 bytes |
 public | history  | table | postgres | 0 bytes    |
 public | tellers  | table | postgres | 72 kB      |

v12+zedstore:

bench=# \i load.sql
COPY 100
Time: 0.656 ms
COPY 1000
Time: 3.573 ms
COPY 10000000
Time: 26244.832 ms (00:26.245)
bench=# \dt+
                      List of relations
 Schema |   Name   | Type  |  Owner   |  Size   | Description
--------+----------+-------+----------+---------+-------------
 public | accounts | table | postgres | 264 MB  |
 public | branches | table | postgres | 56 kB   |
 public | history  | table | postgres | 0 bytes |
 public | tellers  | table | postgres | 64 kB   |

So a good improvement in load times and on disk footprint! Also note
that I did not build with lz4 so looks like you guys have fixed the
quirks with compression making things bigger.

regards

Mark

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-05-25 06:55:11 Re: Runtime pruning problem
Previous Message David Rowley 2019-05-25 04:37:54 Re: Performance issue in foreign-key-aware join estimation