On columnar storage

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: On columnar storage
Date: 2015-06-11 23:03:16
Message-ID: 20150611230316.GM133018@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We hope to have a chance to discuss this during the upcoming developer
unconference in Ottawa. Here are some preliminary ideas to shed some
light on what we're trying to do.

I've been trying to figure out a plan to enable native column stores
(CS or "colstore") for Postgres. Motivations:

* avoid the 32 TB limit for tables
* avoid the 1600 column limit for tables
* increased performance

There already are some third-party CS implementations for Postgres; some
of these work on top of the FDW interface, others are simply proprietary
forks. Since I don't have access to any of their code, it's not much I
can learn from them. If people with insider knowledge on them can chime
in, perhaps we can work together -- collaboration is very welcome.

We're not interested in perpetuating the idea that a CS needs to go
through the FDW mechanism. Even if there's a lot of simplicity of
implementation, it's almost certain to introduce too many limitations.

Simply switching all our code to use columnar storage rather than
row-based storage is unlikely to go well. We're aiming at letting some
columns of tables be part of a CS, while other parts would continue to
be in the heap. At the same time, we're aiming at opening the way for
different CS implementations instead of trying to provide a single
one-size-fits-all one.

There are several parts to this:

1. the CSM API
2. Cataloguing column stores
3. Query processing: rewriter, optimizer, executor

The Column Store Manager API

Since we want to have pluggable implementations, we need to have a
registry of store implementations. I propose we add a catalog
pg_cstore_impl with OID, name, and a bunch of function references to
"open" a store, "getvalue" from it, "getrows" (to which we pass a qual
and get a bunch of tuple IDs back), "putvalue".

This is in line with our procedural language support.

One critical detail is what will be used to identify a heap row when
talking to a CS implementation. There are two main possibilities:

1. use CTIDs
2. use some logical tuple identifier

Using CTIDs is simpler. One disadvantage is that every UPDATE of a row
needs to let the CS know about the new location of the tuple, so that
the value is known associated with the new tuple location as well as the
old. This needs to happen even if the value of the column itself is not

Using logical tuple identifiers solves this problem: an update does not
change the LTID, so the tuple colstore needn't be involved unless the
attribute(s) in the colstore is being changed. The downside is that the
logical tuple identifier must come from somewhere. We could either use
some user attribute, if there's something appropriate. But it's
probably not good to simply use any primary key that the user has
specified. (Also, having an UPDATE change the primary key would be
troublesome). We could also offer the choice of having an autogenerated
value that's not user-visible; we currently don't have non-user-visible
columns, so this would be additional implementation effort.
Furthermore, we should think about interactions between this and the
IDENTITY stuff we currently have for replication -- my impression is
that IDENTITY does not necessarily represent an useful identifier for
column store purposes.

All in all, it seems prudent to limit the initial implementation to use
CTIDs only, and leave LTIDs for a later stage.

Cataloguing Column Stores

Each table with columns in a separate store will have relhasstore=t.
This hints construction of its relcache entry to obtain rows from
pg_cstore for that table. The new catalog pg_cstore looks like this:

cstname | cststoreid | cstrelid | cstnatts | cstatts

cstname is the store name; unique within each relation.
cststoreid is the OID of the pg_cstore_impl row.
cstorerelid is the OID of the table that this cstore is for.
cstnatts is the number of columns in the store
cstatts is an array of attnums contained in this store.

This design permits having multiple stores for a table, and one or
more columns in a store. We will focus on the case that a table has a
single column store, and a column store has a single column, because we
believe this simplifies several things.

Query Processing


Parsing occurs as currently. During query rewrite, specifically at the
bottom of the per-relation loop in fireRIRrules(), we will modify the
query tree: each relation RTE containing a colstore will be replaced
with a JoinExpr containing the relation as left child and the colstore
as right child (1). The colstore RTE will be of a new RTEKind. For
each such change, all Var nodes that point to attnums stored in the
colstore will modified so that they reference the RTE of the colstore
instead (2).

(1) This seems very similar to what convert_ANY_sublink_to_join() does.

(2) This is very similar to ChangeVarNodes does, except that we modify
only some of the var nodes pointing to the relation, not all.

The addition of the new RTE will use a new function
addRangeTableEntryForColstore(). This seems a bit odd, because that's a
parse-time function, but I see that we're already using -ForSubquery in
convert_ANY_sublink_to_join() so it's probably okay.

Another thing worth mentioning is that we need to have the colstore and
the relation have a junk attribute which use to join them. This will be
the CTID for now, as indicated above.


I still haven't formed any idea of what needs to change in planner code
to handle column stores especially. So far I have the idea that we will
need to distinguish them from other RT kinds by using a new value
RELOPT_COLSTORE for RelOptKind. I think it's not okay to use
RELOPT_BASEREL, because a colstore is not a full-blown relation; and
"otherrel" also seems not appropriate because we do need them to appear
in the join tree, as described above, which current otherrels
(Append/MergeAppend children) do not. Some things such as inner join
removal should work on colstore RTEs just like on plain relations.

Unless the optimization problems are shown to be easily solvable, we
will probably disallow having column stores on inherited relations (esp.
partitioned relations). We can improve the optimizar later to allow for
this, after the basics are in place.


There are a few ideas on a couple of new executor nodes that will need
to operate on colstore RTEs. For example, we can delay accessing
columns in a store until we need to project them; if rows from the main
relation are filtered somewhere upstream, any columns in the store
needn't be accessed for the filtered rows. We have been using the name
LateColumnMaterialization for this.

A completely different consideration is offloading some query quals to a
lone column store scan, which can be combined using BitmapOr or
BitmapAnd with other quals on the relation; the idea is something very
similar to IndexHeapScan in that it takes a qual and produces a list of
TIDs. We're calling this BitmapColumnScan.

The research leading to these results has received funding from the
European Union’s Seventh Framework Programme (FP7/2007-2015) under grant
agreement n° 318633.

Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2015-06-11 23:29:21 Re: DBT-3 with SF=20 got failed
Previous Message Stephen Frost 2015-06-11 21:47:24 Re: CREATE POLICY and RETURNING