Re: On columnar storage (2)

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, tomas(dot)vondra(at)2ndquadrant(dot)com, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Subject: Re: On columnar storage (2)
Date: 2015-12-23 13:49:38
Message-ID: 567AA672.2010602@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alvaro,

May be you know, that I have implemented IMCS (in-memory-columnar-store)
as PostgreSQL extension.
It was not so successful, mostly because people prefer to use standard
SQL rather than using some special functions for accessing columnar
storage (CS). Now I am thinking about second reincarnation of IMCS,
based on FDW and CSP (custom nodes). This is why I am very interested in
your patch. I have investigated previous version of the patch and have
some questions.
I will be pleased if you can clarify them to me:

1. CS API.
I agree with you that FDW API seems to be not enough to efficiently
support work with CS.
At least we need batch insert.
But may be it is better to extend FDW API rather than creating special
API for CS?

2. Horizontal<->Vertical data mapping. As far as I understand this
patch, the model of CS assumes that some table columns are stored in
horizontal format (in heap), some - in vertical format (in CS). And
there is one-to-one mapping between horizontal and vertical parts of row
using CTID. But been involved in several projects requiring OLAP, I
found out that in most cases it is more convenient to have one-to-many
mapping. Assume some trading system dealing with stock quotes.
Data looks something like this:

Symbol Day Open Close High Low Volume
AAA 12/22/2015 10.0 12.0 13.0 8.0 100
AAB 12/22/2015 9.0 8.0 10.0 9.0 200
...
AAA 12/23/2015 12.0 11.5 12.5 11.0 50
AAB 12/23/2015 8.0 8.8 8.5 8.0 300

So it can be represented using the following table:

create table Quote (Symbol char(10), Day date, Open real, High
real, Low real, Close real, Volume integer);

Most likely we need to calculate some statistic for particular symbol or
set of symbols.
For example, portfolio is set of symbols and we need to somehow analyze
instruments in this portfolio.

There are about several thousands symbols, tens instruments in
portfolio and tens of thousands quotes per symbol (in other cases size
of timeseries are much larger - millions elements).
How can we efficiently execute query like:

select Symbol,sum(Close*Volume)/sum(Volume) as VWAP from Quote
group by Symbol
where day between '01/01/2001' and '01/01/2010' and Symbol in
('AAA', 'AAB','ABB',...);

If we have index by Symbol, then it will contain a lot of duplicates.
And it is not clear how to efficiently combine index scan by symbol name
and time slice.

One of the possible solution is to embed timeseries into tuples.
In this case we will have something like this:

create table Quote (Symbol char(10), Day timeseries(date), Open
timeseries(real), High timeseries(real),
Low timeseries(real), Close
timeseries(real), Volume timeseries(integer));

We are using here unexisted type timeseries. It is something similar
with array, but its content in stored in columnar storage rather than in
record's TOAST.
In this case we can efficiently locate records by symbol (there are only
few thousands entries in the table) and then perform CS operations with
located timeseries.

So here we also split tuple into horizontal and vertical part. In
horizontal part we store just identifier of timeseries.
Query plan should combine standard nodes with custom CS nodes. Mixing
horizontal and vertical operations significantly complicates optimizer
and restricts flexibility: having proposed representation it is
difficult to efficiently calculate some characteristic for all symbols
in specified time range. This is why I am not sure that it is the only
possible and most efficient approach. But in any case there should be
some efficient plan for queries like above.

3. Transpose of data and role of CS.
Let's look once again on Quote example above. Data is received in time
ascending order. But most queries require grouping it by symbol. So at
some stage we have to "transpose" data. To efficiently append data to
timeseries we need to buffer it somewhere and then use append range of
values. In Fujitsu approach two different representations of data are
used: reader and writer optimized. In IMCS approach, CS is just
temporary projection of normal PostgreSQL tables. So we do not need to
worry about durability - it is enforced by PostgreSQL.

So the question is whether CS should be only storage for the data or
just copy (may be transient) of normal table?

Best regards,
Konstantin

On 22.12.2015 17:43, Alvaro Herrera wrote:
> Michael Paquier wrote:
>> On Wed, Dec 9, 2015 at 3:10 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>> Could we get this rebased past the merge of the parallel execution commits?
>> +1. Alvaro, Tomas, Simon, what are the next plans with those patches?
> Yeah, I've been working intermittently on getting the whole tree rebased
> and squashed, because after the last submission we made a lot of
> progress. I'll repost later. I think it should be marked "returned
> with feedback" for now.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2015-12-23 13:52:20 Re: BUG #13741: vacuumdb does not accept valid password
Previous Message Michael Paquier 2015-12-23 13:48:57 Re: Patch: Optimize memory allocation in function 'bringetbitmap'