Re: Zedstore - compressed in-core columnar storage

From: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2019-04-09 02:26:53
Message-ID: CALfoeisRzUTEyrWzBYjpNzA2XvCkfjY3ppsh+NHTa_C0GJc_FQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 8, 2019 at 6:04 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2019-04-08 17:27:05 -0700, Ashwin Agrawal wrote:
> > Heikki and I have been hacking recently for few weeks to implement
> > in-core columnar storage for PostgreSQL. Here's the design and initial
> > implementation of Zedstore, compressed in-core columnar storage (table
> > access method).
>
> That's very cool.
>
>
> > Motivations / Objectives
> >
> > * Performance improvement for queries selecting subset of columns
> > (reduced IO).
> > * Reduced on-disk footprint compared to heap table. Shorter tuple
> > headers and also leveraging compression of similar type data
> > * Be first-class citizen in the Postgres architecture (tables data can
> > just independently live in columnar storage)
> > * Fully MVCC compliant
> > * All Indexes supported
> > * Hybrid row-column store, where some columns are stored together, and
> > others separately. Provide flexibility of granularity on how to
> > divide the columns. Columns accessed together can be stored
> > together.
> > * Provide better control over bloat (similar to zheap)
> > * Eliminate need for separate toast tables
> > * Faster add / drop column or changing data type of column by avoiding
> > full rewrite of the table.
>
> Is storage going through the bufmgr.c or separately?
>

Yes, below access method its pretty much same as heap. All reads and writes
flow via buffer cache. The implementation sits nicely in between, just
modifying the access method code changing how just how data is stored in
pages, above AM and below AM is basically all behaves similar to heap code.

> > In uncompressed form, the page can be arbitrarily large. But after
> > compression, it must fit into a physical 8k block. If on insert or
> > update of a tuple, the page cannot be compressed below 8k anymore, the
> > page is split. Note that because TIDs are logical rather than physical
> > identifiers, we can freely move tuples from one physical page to
> > another during page split. A tuple's TID never changes.
>
> When does compression happen? After each modifcation of the expanded
>
"page"? Are repeated expansions prevented somehow, e.g. if I
> insert/delete rows into the same page one-by-one?
>

Compression is performed with new data is only if page becomes full, till
then uncompressed data is added to the page. If even after compression
cannot add data to the page then page split is performed. Already
compressed data is not compressed again on next insert. New compressed
block is created for newly added uncompressed items.

The line of thought we have for delete is will not free the space as soon
as delete is performed, but instead delay and reuse the space deleted on
next insertion to the page.

> A metapage at block 0, has links to the roots of the B-trees. Leaf
> > pages look the same, but instead of storing the whole tuple, stores
> > just a single attribute. To reconstruct a row with given TID, scan
> > descends down the B-trees for all the columns using that TID, and
> > fetches all attributes. Likewise, a sequential scan walks all the
> > B-trees in lockstep.
>
> Does the size of the metapage limit the number of column [groups]? Or is
> there some overflow / tree of trees / whatnot happening?
>

In design it doesn't limit the number of columns, as can have chain of
meta-pages to store the required meta-data, page 0 still being start of the
chain.

> > Insert:
> > Inserting a new row, splits the row into datums. Then for first column
> > decide which block to insert the same to, and pick a TID for it, and
> > write undo record for the same. Rest of the columns are inserted using
> > that same TID and point to same undo position.
>
> Is there some buffering? Without that it seems like retail inserts are
> going to be pretty slow?
>

Yes, regular buffer cache.

> > Property is added to Table AM to convey if column projection is
> > leveraged by AM for scans. While scanning tables with AM leveraging
> > this property, executor parses the plan. Leverages the target list and
> > quals to find the required columns for query. This list is passed down
> > to AM on beginscan. Zedstore uses this column projection list to only
> > pull data from selected columns. Virtual tuple table slot is used to
> > pass back the datums for subset of columns.
> >
> > Current table am API requires enhancement here to pass down column
> > projection to AM. The patch showcases two different ways for the same.
> >
> > * For sequential scans added new beginscan_with_column_projection()
> > API. Executor checks AM property and if it leverages column
> > projection uses this new API else normal beginscan() API.
> >
> > * For index scans instead of modifying the begin scan API, added new
> > API to specifically pass column projection list after calling begin
> > scan to populate the scan descriptor but before fetching the tuples.
>
> FWIW, I don't quite think this is the right approach. I've only a vague
> sketch of this in my head, but I think we should want a general API to
> pass that down to *any* scan. Even for heap, not deforming leading
> columns that a uninteresting, but precede relevant columns, would be
> quite a noticable performance win. I don't think the projection list is
> the right approach for that.
>

Sure, would love to hear more on it and can enhance the same as makes more
usable for AMs.

> > Extremely basic UNDO logging has be implemented just for MVCC
> > perspective. MVCC is missing tuple lock right now. Plus, doesn't
> > actually perform any undo yet. No WAL logging exist currently hence
> > its not crash safe either.
>
> Have you looked at the undo APIs developed for zheap, as discussed on
> the list? Seems important that they're suitable for this too.
>

Not in details yet, but yes plan is to leverage the same common framework
and undo log API as zheap. Will look into the details. With the current
zedstore implementation the requirements from the undo are prertty clear.

> > Test mentioned in thread "Column lookup in a row performance" [6],
> > good example query for zedstore locally on laptop using lz4 shows
> >
> > postgres=# SELECT AVG(i199) FROM (select i199 from layout offset 0) x; --
> > heap
> > avg
> > ---------------------
> > 500000.500000000000
> > (1 row)
> >
> > Time: 4679.026 ms (00:04.679)
> >
> > postgres=# SELECT AVG(i199) FROM (select i199 from zlayout offset 0) x;
> --
> > zedstore
> > avg
> > ---------------------
> > 500000.500000000000
> > (1 row)
> >
> > Time: 379.710 ms
>
> Well, I'm not sure I'm actually impressed by that. What does the
> performance look like if you select i0 instead?
>

Just for quick test used 100 instead of 200 columns (with 200 the results
would be more diverged), this is what it reports

postgres=# SELECT AVG(i0) FROM (select i0 from layout offset 0) x; -- heap
avg
------------------------
1.00000000000000000000
(1 row)

Time: 183.865 ms
postgres=# SELECT AVG(i0) FROM (select i0 from zlayout offset 0) x; --
zedstore
avg
------------------------
1.00000000000000000000
(1 row)

Time: 47.624 ms

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-04-09 02:29:18 Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Previous Message Haribabu Kommi 2019-04-09 02:12:23 Re: Status of the table access method work