Re: [WIP]Vertical Clustered Index (columnar store extension)

From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP]Vertical Clustered Index (columnar store extension)
Date: 2017-01-22 23:26:10
Message-ID: 570c522f-3736-c0ac-83db-0313318d939d@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/16/17 10:09 PM, Haribabu Kommi wrote:
> Yes, that' correct. Currently with this approach, it is not possible to
> ditch the
> heap completely. This approach is useful for the cases, where the user wants
> to store only some columns as part of clustered index.

Ahh, that's unfortunate. Billion row+ tables are becoming rather common,
and that 24GB of overhead starts becoming very painful. It's actually a
lot worse considering there will be at least one index on the table, so
100GB+ of overhead isn't that uncommon.

> Another complication is that one of the big advantages of a CSTORE
> is allowing analysis to be done efficiently on a column-by-column
> (as opposed to row-by-row) basis. Does your patch by chance provide
> that?
>
> Not the base patch that I shared. But the further patches provides the
> data access
> column-by-column basis using the custom plan methods.

Great, that's something else that a column store really needs to be
successful. Something else I suspect is necessary is a faster/better way
to eliminate chunks of rows from scans.

Just as an example, with my simple array-based approach, you can store a
range type along with each array that contains the min and max values
for the array. That means any query that wants values between 50 and 100
can include a clause that filters on range types that overlap with
[50,100]. That can be indexed very efficiently and is fast to run checks
against.

> Generally speaking, I do think the idea of adding support for this
> as an "index" is a really good starting point, since that part of

... as discussed elsewhere in the thread, adding a bunch of hooks is
probably not a good way to do this. :/

> That would be a great way to gain knowledge on what users would want
> to see in a column store, something else I suspect we need. It would
> also be far less code than what you or Alvaro are proposing. When it
> comes to large changes that don't have crystal-clear requirements, I
> think that's really important.
>
> The main use case of this patch is to support mixed load environments,
> where both OLTP and OLAP queries are possible. The advantage of
> proposed patch design is, providing good performance to OLAP queries
> without affecting OLTP.

Yeah, that's a big part of what I was envisioning with my array-based
approach. In simple terms, there would be a regular row-based table, and
an array-based table, with a view that allows seamless querying into
both (re-presenting the array-storage on a per-row basis). There would
be a periodic process that moves entire sets of rows from the row
storage into the array storage.

If you updated or deleted a row that was part of an array, the contents
of the entire array could be moved back into row-based storage. After a
period of time, rows would get moved back into array storage. Or the
array could be modified in place, but you need to be very careful about
bloating the array storage if you do that.

The big missing piece here is getting the planner to intelligently
handle a mixed row/column store. As I mentioned, you can easily add
range type fields to greatly increase performance, but they won't do any
good unless the appropriate filters get added. It's not THAT hard to do
that by hand, but it'd be great if there was a more automated method.
Such a method might also be very useful for transforming expressions
like date_part('quarter', ...) into something that could use existing
indexes.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2017-01-22 23:56:59 \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)
Previous Message Petr Jelinek 2017-01-22 23:07:27 Re: Logical replication failing when foreign key present