| From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> | 
|---|---|
| To: | Ashwin Agrawal <aagrawal(at)pivotal(dot)io> | 
| Cc: | PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: Zedstore - compressed in-core columnar storage | 
| Date: | 2019-04-10 06:29:03 | 
| Message-ID: | CAA4eK1Kw1h6SE9dO243T8cvak4CPQcvEXpJMPBzkGdVwRP_tBg@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Tue, Apr 9, 2019 at 5:57 AM Ashwin Agrawal <aagrawal(at)pivotal(dot)io> wrote:
>
> Heikki and I have been hacking recently for few weeks to implement
> in-core columnar storage for PostgreSQL. Here's the design and initial
> implementation of Zedstore, compressed in-core columnar storage (table
> access method). Attaching the patch and link to github branch [1] to
> follow along.
>
> The objective is to gather feedback on design and approach to the
> same. The implementation has core basic pieces working but not close
> to complete.
>
> Big thank you to Andres, Haribabu and team for the table access method
> API's. Leveraged the API's for implementing zedstore, and proves API
> to be in very good shape. Had to enhance the same minimally but
> in-general didn't had to touch executor much.
>
> Motivations / Objectives
>
> * Performance improvement for queries selecting subset of columns
>   (reduced IO).
> * Reduced on-disk footprint compared to heap table. Shorter tuple
>   headers and also leveraging compression of similar type data
> * Be first-class citizen in the Postgres architecture (tables data can
>   just independently live in columnar storage)
> * Fully MVCC compliant
> * All Indexes supported
> * Hybrid row-column store, where some columns are stored together, and
>   others separately. Provide flexibility of granularity on how to
>   divide the columns. Columns accessed together can be stored
>   together.
> * Provide better control over bloat (similar to zheap)
> * Eliminate need for separate toast tables
> * Faster add / drop column or changing data type of column by avoiding
>   full rewrite of the table.
>
> High-level Design - B-trees for the win!
> ========================================
>
> To start simple, let's ignore column store aspect for a moment and
> consider it as compressed row store. The column store is natural
> extension of this concept, explained in next section.
>
> The basic on-disk data structure leveraged is a B-tree, indexed by
> TID. BTree being a great data structure, fast and versatile. Note this
> is not referring to existing Btree indexes, but instead net new
> separate BTree for table data storage.
>
> TID - logical row identifier:
> TID is just a 48-bit row identifier. The traditional division into
> block and offset numbers is meaningless. In order to find a tuple with
> a given TID, one must always descend the B-tree. Having logical TID
> provides flexibility to move the tuples around different pages on page
> splits or page merges can be performed.
>
> The internal pages of the B-tree are super simple and boring. Each
> internal page just stores an array of TID and downlink pairs. Let's
> focus on the leaf level. Leaf blocks have short uncompressed header,
> followed by btree items. Two kinds of items exist:
>
>  - plain item, holds one tuple or one datum, uncompressed payload
>  - a "container item", holds multiple plain items, compressed payload
>
> +-----------------------------
> | Fixed-size page header:
> |
> |   LSN
> |   TID low and hi key (for Lehman & Yao B-tree operations)
> |   left and right page pointers
> |
> | Items:
> |
> |   TID | size | flags | uncompressed size | lastTID | payload (container item)
> |   TID | size | flags | uncompressed size | lastTID | payload (container item)
> |   TID | size | flags | undo pointer | payload (plain item)
> |   TID | size | flags | undo pointer | payload (plain item)
> |   ...
> |
> +----------------------------
>
> Row store
> ---------
>
> The tuples are stored one after another, sorted by TID. For each
> tuple, we store its 48-bit TID, a undo record pointer, and the actual
> tuple data uncompressed.
>
Storing undo record pointer with each tuple can take quite a lot of
space in cases where you can't compress them.  Have you thought how
will you implement the multi-locker scheme in this design?  In zheap,
we have used undo for the same and it is easy to imagine when you have
separate transaction slots for each transaction.  I am not sure how
will you implement the same here.
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
| From | Date | Subject | |
|---|---|---|---|
| Next Message | tushar | 2019-04-10 06:41:21 | Re: Minimal logical decoding on standbys | 
| Previous Message | Peifeng Qiu | 2019-04-10 06:27:26 | Re: Speed up build on Windows by generating symbol definition in batch |