Re: [WIP]Vertical Clustered Index (columnar store extension)

From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP]Vertical Clustered Index (columnar store extension)
Date: 2017-01-17 04:09:57
Message-ID: CAJrrPGefW1VomwWUS9BaLoXyHEAu9dLcW5s3JNNp5FxA0R5cFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jan 8, 2017 at 2:01 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:

> On 12/29/16 9:55 PM, Haribabu Kommi wrote:
>
>> The tuples which don't have multiple copies or frozen data will be moved
>> from WOS to ROS periodically by the background worker process or autovauum
>> process. Every column data is stored separately in it's relation file.
>> There
>> is no transaction information is present in ROS. The data in ROS can be
>> referred with tuple ID.
>>
>
> Would updates be handled via the delete mechanism you described then?
>

Updates are handled similar like delete operations, but there are some extra
index insert operations occurs in this index even when the update is of HOT
type, because of TID-CRID mapping.

> In this approach, the column data is present in both heap and columnar
>> storage.
>>
>
> ISTM one of the biggest reasons to prefer a column store over heap is to
> ditch the 24 byte overhead, so I'm not sure how much of a win this is.
>

Yes, that' correct. Currently with this approach, it is not possible to
ditch the
heap completely. This approach is useful for the cases, where the user wants
to store only some columns as part of clustered index.

Another complication is that one of the big advantages of a CSTORE is
> allowing analysis to be done efficiently on a column-by-column (as opposed
> to row-by-row) basis. Does your patch by chance provide that?
>

Not the base patch that I shared. But the further patches provides the data
access
column-by-column basis using the custom plan methods.

> Generally speaking, I do think the idea of adding support for this as an
> "index" is a really good starting point, since that part of the system is
> pluggable. It might be better to target getting only what needs to be in
> core into core to begin with, allowing the other code to remain an
> extension for now. I think there's a lot of things that will be discovered
> as we start moving into column stores, and it'd be very unfortunate to
> accidentally paint the core code into a corner somewhere.
>

Yes, it is possible to add only the code that is required in the core and
keep the other part
as extension. Without providing the complete clustered index approach, I
doubt whether
the necessary hooks and it's code gets accepted to the core.

> As a side note, it's possible to get a lot of the benefits of a column
> store by using arrays. I've done some experiments with that and got an
> 80-90% space reduction, and most queries saw improved performance as well
> (there were a few cases that weren't better). The biggest advantage to this
> approach is people could start using it today, on any recent version of
> Postgres.

Interesting experiment.

> That would be a great way to gain knowledge on what users would want to
> see in a column store, something else I suspect we need. It would also be
> far less code than what you or Alvaro are proposing. When it comes to large
> changes that don't have crystal-clear requirements, I think that's really
> important.
>

The main use case of this patch is to support mixed load environments,
where both OLTP and OLAP queries are possible. The advantage of
proposed patch design is, providing good performance to OLAP queries
without affecting OLTP.

Regards,
Hari Babu
Fujitsu Australia

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2017-01-17 04:37:56 Re: Cache Hash Index meta page.
Previous Message Fujii Masao 2017-01-17 03:48:19 Re: [PATCH] Rename pg_switch_xlog to pg_switch_wal