Re: Column stores

From: Luke Lonergan <llonergan(at)greenplum(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>, Seth Grimes <grimes(at)altaplana(dot)com>
Cc: "pgsql-advocacy(at)postgresql(dot)org" <pgsql-advocacy(at)postgresql(dot)org>
Subject: Re: Column stores
Date: 2008-02-02 17:44:59
Message-ID: C3C9EC1B.5291C%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-advocacy

Hi Simon, Seth,

On 1/30/08 10:44 AM, "Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:

> Any implementation for Postgres would gain benefit from blending row and
> column approaches within the same database, as an option rather than as
> a must-have.

Agreed and this is a known "must-have" for a modern column store to be
competitive.

IMO the main benefits claimed and attainable by column stores are derived
from:
1) differentially better compression for some columns being separated from
others
2) a tendency for the implementations to remove abstraction from the
executor and do more than one row at a time, thereby improving processor
efficiency. See the X100 project for an example

Claimed benefits that aren't that generally applicable include the ability
to operate directly on compressed data, thereby improving memory bandwidth
and CPU usage.

The major drawbacks include another point that you mentioned Simon - there
is a lot of overhead to the approach of putting isolated columns out to disk
and having to re-assemble them all the time. If you are doing a query with
a few columns chosen from a hundred in a table, then this easily pays off,
but how many people are running with flattened schemas today? If there are
enough to make a market, then we'd see a lot more simple approaches to
getting the work done - we'd not have needed to spend 4 years building what
we have for instance.

For this reason, some of the column implementations have created an approach
that looks a whole lot like indexing - they're calling it "projection" and I
think it is what it sounds like. They have to "pre-run" the queries through
an analyzer that chooses which columns to project and they may even
duplicate the data into the new columns - sounds familiar - it's an index.

If the best approach to column for DW is to create indexes, that's another
old idea and we've got plenty of that in Postgres. That said - PG needs the
true bitmap index to compete IMO, along with index-only access.

Lastly there's the update problem. When you vertically partition the
schema, you have to coordinate the updates across all of the columns, each
of which are compressed differently. It's not an unsolvable problem, it's
just another overhead associated with the approach, but it's a big one.

Taking all of the above into consideration, there's an approach I'd like to
see in Postgres that can deliver everything the column people claim without
the drawbacks. +1 to have a technical discussion about it.

- Luke

In response to

Browse pgsql-advocacy by date

  From Date Subject
Next Message Jean-Paul Argudo 2008-02-03 08:49:44 Re: PostgreSQL Certification
Previous Message Joshua D. Drake 2008-02-02 16:52:36 Re: Column stores