Re: Zedstore - compressed in-core columnar storage

From: Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>
To: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2019-04-11 13:05:34
Message-ID: CA+FpmFe+FajyweVcz=2HjURWg1wHiyzB5RF-aG35Or1_o112aQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 9 Apr 2019 at 02:27, Ashwin Agrawal <aagrawal(at)pivotal(dot)io> wrote:

> Heikki and I have been hacking recently for few weeks to implement
> in-core columnar storage for PostgreSQL. Here's the design and initial
> implementation of Zedstore, compressed in-core columnar storage (table
> access method). Attaching the patch and link to github branch [1] to
> follow along.
>
> The objective is to gather feedback on design and approach to the
> same. The implementation has core basic pieces working but not close
> to complete.
>
> Big thank you to Andres, Haribabu and team for the table access method
> API's. Leveraged the API's for implementing zedstore, and proves API
> to be in very good shape. Had to enhance the same minimally but
> in-general didn't had to touch executor much.
>
> Motivations / Objectives
>
> * Performance improvement for queries selecting subset of columns
> (reduced IO).
> * Reduced on-disk footprint compared to heap table. Shorter tuple
> headers and also leveraging compression of similar type data
> * Be first-class citizen in the Postgres architecture (tables data can
> just independently live in columnar storage)
> * Fully MVCC compliant
> * All Indexes supported
> * Hybrid row-column store, where some columns are stored together, and
> others separately. Provide flexibility of granularity on how to
> divide the columns. Columns accessed together can be stored
> together.
> * Provide better control over bloat (similar to zheap)
> * Eliminate need for separate toast tables
> * Faster add / drop column or changing data type of column by avoiding
> full rewrite of the table.
>
> High-level Design - B-trees for the win!
> ========================================
>
> To start simple, let's ignore column store aspect for a moment and
> consider it as compressed row store. The column store is natural
> extension of this concept, explained in next section.
>
> The basic on-disk data structure leveraged is a B-tree, indexed by
> TID. BTree being a great data structure, fast and versatile. Note this
> is not referring to existing Btree indexes, but instead net new
> separate BTree for table data storage.
>
> TID - logical row identifier:
> TID is just a 48-bit row identifier. The traditional division into
> block and offset numbers is meaningless. In order to find a tuple with
> a given TID, one must always descend the B-tree. Having logical TID
> provides flexibility to move the tuples around different pages on page
> splits or page merges can be performed.
>
> The internal pages of the B-tree are super simple and boring. Each
> internal page just stores an array of TID and downlink pairs. Let's
> focus on the leaf level. Leaf blocks have short uncompressed header,
> followed by btree items. Two kinds of items exist:
>
> - plain item, holds one tuple or one datum, uncompressed payload
> - a "container item", holds multiple plain items, compressed payload
>
> +-----------------------------
> | Fixed-size page header:
> |
> | LSN
> | TID low and hi key (for Lehman & Yao B-tree operations)
> | left and right page pointers
> |
> | Items:
> |
> | TID | size | flags | uncompressed size | lastTID | payload (container
> item)
> | TID | size | flags | uncompressed size | lastTID | payload (container
> item)
> | TID | size | flags | undo pointer | payload (plain item)
> | TID | size | flags | undo pointer | payload (plain item)
> | ...
> |
> +----------------------------
>
> Row store
> ---------
>
> The tuples are stored one after another, sorted by TID. For each
> tuple, we store its 48-bit TID, a undo record pointer, and the actual
> tuple data uncompressed.
>
> In uncompressed form, the page can be arbitrarily large. But after
> compression, it must fit into a physical 8k block. If on insert or
> update of a tuple, the page cannot be compressed below 8k anymore, the
> page is split. Note that because TIDs are logical rather than physical
> identifiers, we can freely move tuples from one physical page to
> another during page split. A tuple's TID never changes.
>
> The buffer cache caches compressed blocks. Likewise, WAL-logging,
> full-page images etc. work on compressed blocks. Uncompression is done
> on-the-fly, as and when needed in backend-private memory, when
> reading. For some compressions like rel encoding or delta encoding
> tuples can be constructed directly from compressed data.
>
> Column store
> ------------
>
> A column store uses the same structure but we have *multiple* B-trees,
> one for each column, all indexed by TID. The B-trees for all columns
> are stored in the same physical file.
>
> A metapage at block 0, has links to the roots of the B-trees. Leaf
> pages look the same, but instead of storing the whole tuple, stores
> just a single attribute. To reconstruct a row with given TID, scan
> descends down the B-trees for all the columns using that TID, and
> fetches all attributes. Likewise, a sequential scan walks all the
> B-trees in lockstep.
>
> So, in summary can imagine Zedstore as forest of B-trees, one for each
> column, all indexed by TIDs.
>
> This way of laying out the data also easily allows for hybrid
> row-column store, where some columns are stored together, and others
> have a dedicated B-tree. Need to have user facing syntax to allow
> specifying how to group the columns.
>
>
> Main reasons for storing data this way
> --------------------------------------
>
> * Layout the data/tuples in mapped fashion instead of keeping the
> logical to physical mapping separate from actual data. So, keep the
> meta-data and data logically in single stream of file, avoiding the
> need for separate forks/files to store meta-data and data.
>
> * Stick to fixed size physical blocks. Variable size blocks pose need
> for increased logical to physical mapping maintenance, plus
> restrictions on concurrency of writes and reads to files. Hence
> adopt compression to fit fixed size blocks instead of other way
> round.
>
>
> MVCC
> ----
> MVCC works very similar to zheap for zedstore. Undo record pointers
> are used to implement MVCC. Transaction information if not directly
> stored with the data. In zheap, there's a small, fixed, number of
> "transaction slots" on each page, but zedstore has undo pointer with
> each item directly; in normal cases, the compression squeezes this
> down to almost nothing.
>
>
> Implementation
> ==============
>
> Insert:
> Inserting a new row, splits the row into datums. Then for first column
> decide which block to insert the same to, and pick a TID for it, and
> write undo record for the same. Rest of the columns are inserted using
> that same TID and point to same undo position.
>
> Compression:
> Items are added to Btree in uncompressed form. If page is full and new
> item can't be added, compression kicks in. Existing uncompressed items
> (plain items) of the page are passed to compressor for
> compression. Already compressed items are added back as is. Page is
> rewritten with compressed data with new item added to it. If even
> after compression, can't add item to page, then page split happens.
>
> Toast:
> When an overly large datum is stored, it is divided into chunks, and
> each chunk is stored on a dedicated toast page within the same
> physical file. The toast pages of a datum form list, each page has a
> next/prev pointer.
>
> Select:
> Property is added to Table AM to convey if column projection is
> leveraged by AM for scans. While scanning tables with AM leveraging
> this property, executor parses the plan. Leverages the target list and
> quals to find the required columns for query. This list is passed down
> to AM on beginscan. Zedstore uses this column projection list to only
> pull data from selected columns. Virtual tuple table slot is used to
> pass back the datums for subset of columns.
>
> Current table am API requires enhancement here to pass down column
> projection to AM. The patch showcases two different ways for the same.
>
> * For sequential scans added new beginscan_with_column_projection()
> API. Executor checks AM property and if it leverages column
> projection uses this new API else normal beginscan() API.
>
> * For index scans instead of modifying the begin scan API, added new
> API to specifically pass column projection list after calling begin
> scan to populate the scan descriptor but before fetching the tuples.
>
> Index Support:
> Building index also leverages columnar storage and only scans columns
> required to build the index. Indexes work pretty similar to heap
> tables. Data is inserted into tables and TID for the tuple same gets
> stored in index. On index scans, required column Btrees are scanned
> for given TID and datums passed back using virtual tuple.
>
> Page Format:
> ZedStore table contains different kinds of pages, all in the same
> file. Kinds of pages are meta-page, per-attribute btree internal and
> leaf pages, UNDO log page, and toast pages. Each page type has its own
> distinct data storage format.
>
> Block 0 is always a metapage. It contains the block numbers of the
> other data structures stored within the file, like the per-attribute
> B-trees, and the UNDO log.
>
>
> Enhancements to design:
> =======================
>
> Instead of compressing all the tuples on a page in one batch, we could
> store a small "dictionary", e.g. in page header or meta-page, and use
> it to compress each tuple separately. That could make random reads and
> updates of individual tuples faster.
>
> When adding column, just need to create new Btree for newly added
> column and linked to meta-page. No existing content needs to be
> rewritten.
>
> When the column is dropped, can scan the B-tree of that column, and
> immediately mark all the pages as free in the FSM. But we don't
> actually have to scan the leaf level: all leaf tuples have a downlink
> in the parent, so we can scan just the internal pages. Unless the
> column is very wide, that's only a small fraction of the data. That
> makes the space immediately reusable for new insertions, but it won't
> return the space to the Operating System. In order to do that, we'd
> still need to defragment, moving pages from the end of the file closer
> to the beginning, and truncate the file.
>
> In this design, we only cache compressed pages in the page cache. If
> we want to cache uncompressed pages instead, or in addition to that,
> we need to invent a whole new kind of a buffer cache that can deal
> with the variable-size blocks.
>
> If you do a lot of updates, the file can get fragmented, with lots of
> unused space on pages. Losing the correlation between TIDs and
> physical order is also bad, because it will make SeqScans slow, as
> they're not actually doing sequential I/O anymore. We can write a
> defragmenter to fix things up. Half-empty pages can be merged, and
> pages can be moved to restore TID/physical correlation. This format
> doesn't have the same MVCC problems with moving tuples around that the
> Postgres heap does, so it can be fairly easily be done on-line.
>
> Min-Max values can be stored for block to easily skip scanning if
> column values doesn't fall in range.
>
> Notes about current patch
> =========================
>
> Basic (core) functionality is implemented to showcase and play with.
>
> Two compression algorithms are supported Postgres pg_lzcompress and
> lz4. Compiling server with --with-lz4 enables the LZ4 compression for
> zedstore else pg_lzcompress is default. Definitely LZ4 is super fast
> at compressing and uncompressing.
>
> Not all the table AM API's are implemented. For the functionality not
> implmented yet will ERROR out with not supported. Zedstore Table can
> be created using command:
>
> CREATE TABLE <name> (column listing) USING zedstore;
>
> Bulk load can be performed using COPY. INSERT, SELECT, UPDATE and
> DELETES work. Btree indexes can be created. Btree and bitmap index
> scans work. Test in src/test/regress/sql/zedstore.sql showcases all
> the functionality working currently. Updates are currently implemented
> as cold, means always creates new items and not performed in-place.
>
> TIDs currently can't leverage the full 48 bit range but instead need
> to limit to values which are considered valid ItemPointers. Also,
> MaxHeapTuplesPerPage pose restrictions on the values currently it can
> have. Refer [7] for the same.
>
> Extremely basic UNDO logging has be implemented just for MVCC
> perspective. MVCC is missing tuple lock right now. Plus, doesn't
> actually perform any undo yet. No WAL logging exist currently hence
> its not crash safe either.
>
> Helpful functions to find how many pages of each type is present in
> zedstore table and also to find compression ratio is provided.
>
> Test mentioned in thread "Column lookup in a row performance" [6],
> good example query for zedstore locally on laptop using lz4 shows
>
> postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
> heap
> avg
> ---------------------
> 500000.500000000000
> (1 row)
>
> Time: 4679.026 ms (00:04.679)
>
> postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x; --
> zedstore
> avg
> ---------------------
> 500000.500000000000
> (1 row)
>
> Time: 379.710 ms
>
> Important note:
> ---------------
> Planner has not been modified yet to leverage the columnar
> storage. Hence, plans using "physical tlist" optimization or such good
> for row store miss out to leverage the columnar nature
> currently. Hence, can see the need for subquery with OFFSET 0 above to
> disable the optimization and scan only required column.
>
>
>
> The current proposal and discussion is more focused on AM layer work
> first. Hence, currently intentionally skips to discuss the planner or
> executor "feature" enhancements like adding vectorized execution and
> family of features.
>
> Previous discussions or implementations for column store Vertical
> cluster index [2], Incore columnar storage [3] and [4], cstore_fdw [5]
> were refered to distill down objectives and come up with design and
> implementations to avoid any earlier concerns raised. Learnings from
> Greenplum Database column store also leveraged while designing and
> implementing the same.
>
> Credit: Design is moslty brain child of Heikki, or actually his
> epiphany to be exact. I acted as idea bouncing board and contributed
> enhancements to the same. We both are having lot of fun writing the
> code for this.
>
>
> References
> 1] https://github.com/greenplum-db/postgres/tree/zedstore
> 2]
> https://www.postgresql.org/message-id/flat/CAJrrPGfaC7WC9NK6PTTy6YN-NN%2BhCy8xOLAh2doYhVg5d6HsAA%40mail.gmail.com
> 3]
> https://www.postgresql.org/message-id/flat/20150611230316.GM133018%40postgresql.org
> 4]
> https://www.postgresql.org/message-id/flat/20150831225328.GM2912%40alvherre.pgsql
> 5] https://github.com/citusdata/cstore_fdw
> 6]
> https://www.postgresql.org/message-id/flat/CAOykqKfko-n5YiBJtk-ocVdp%2Bj92Apu5MJBwbGGh4awRY5NCuQ%40mail.gmail.com
> 7]
> https://www.postgresql.org/message-id/d0fc97bd-7ec8-2388-e4a6-0fda86d71a43%40iki.fi
>
>
Reading about it reminds me of this work -- TAG column storage(
http://www09.sigmod.org/sigmod/record/issues/0703/03.article-graefe.pdf ).
Isn't this storage system inspired from there, with TID as the TAG?

It is not referenced here so made me wonder.
--
Regards,
Rafia Sabih

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rafia Sabih 2019-04-11 13:12:33 Re: Zedstore - compressed in-core columnar storage
Previous Message Etsuro Fujita 2019-04-11 13:05:19 Issue in ExecCleanupTupleRouting()