Re: Yet another vectorized engine

From: Hubert Zhang <hzhang(at)pivotal(dot)io>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Gang Xiong <gxiong(at)pivotal(dot)io>, Asim R P <apraveen(at)pivotal(dot)io>, Ning Yu <nyu(at)pivotal(dot)io>
Subject: Re: Yet another vectorized engine
Date: 2019-12-02 01:15:32
Message-ID: CAB0yrenQ7miZf1zu7au-R0A+HdEEdtummpST+VKn5ZckgsWUrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Konstantin,
Thanks for your reply.

On Fri, Nov 29, 2019 at 12:09 AM Konstantin Knizhnik <
k(dot)knizhnik(at)postgrespro(dot)ru> wrote:

> On 28.11.2019 12:23, Hubert Zhang wrote:
>
> We just want to introduce another POC for vectorized execution engine
> https://github.com/zhangh43/vectorize_engine and want to get some
> feedback on the idea.
>
> But I do not completely understand why you are proposing to implement it
> as extension.
> Yes, custom nodes makes it possible to provide vector execution without
> affecting Postgres core.
> But for efficient integration of zedstore and vectorized executor we need
> to extend table-AM (VectorTupleTableSlot and correspondent scan functions).
> Certainly it is easier to contribute vectorized executor as extension, but
> sooner or later I think it should be added to Postgres core.
>
> As far as I understand you already have some prototype implementation
> (otherwise how you got the performance results)?
> If so, are you planning to publish it or you think that executor should be
> developed from scratch?
>

The prototype extension is at https://github.com/zhangh43/vectorize_engine
I agree vectorized executor should be added to Postgres core some days.
But it is such a huge feature and need to change from not only the extended
table-AM you mentioned and also every executor node , such as Agg,Join,Sort
node etc. What's more, the expression evaluation function and aggregate's
transition function, combine function etc. We all need to supply a
vectorized version for them. Hence, implementing it as an extension first
and if it is popular among community and stable, we could merge it into
Postgres core whenever we want.

We do want to get some feedback from the community about CustomScan.
CustomScan is just an abstract layer. It's typically used to support user
defined scan node, but some other PG extensions(pgstorm) have already used
it as a general CustomNode e.g. Agg, Join etc. Since vectorized engine need
to support vectorized processing in all executor node, follow the above
idea, our choice is to use CustomScan.

> Some my concerns based on VOPS experience:
>

> 1. Vertical (columnar) model is preferable for some kind of queries, but
> there are some classes of queries for which it is less efficient.
> Moreover, data is used to be imported in the database in row format.
> Inserting it in columnar store record-by-record is very inefficient.
> So you need some kind of bulk loader which will be able to buffer input
> data before loading it in columnar store.
> Actually this problem it is more related with data model rather than
> vectorized executor. But what I want to express here is that it may be
> better to have both representation (horizontal and vertical)
> and let optimizer choose most efficient one for particular query.
>
>
Yes, in general, for OLTP queries, row format is better and for OLAP
queries column format is better.
As for storage type(or data model), I think DBA should choose row or column
store to use for a specific table.
As for executor, it's a good idea to let optimizer to choose based on cost.
It is a long term goal and our extension now will fallback to original row
executor for Insert,Update,IndexScan cases in a rough way.
We want our extension could be enhanced in a gradual way.

> 2. Columnar store and vectorized executor are most efficient for query
> like "select sum(x) from T where ...".
> Unfortunately such simple queries are rarely used in real life. Usually
> analytic queries contain group-by and joins.
> And here vertical model is not always optimal (you have to reconstruct
> rows from columns to perform join or grouping).
> To provide efficient execution of queries you may need to create multiple
> different projections of the same data (sorted by different subset of
> attributes).
> This is why Vertica (one of the most popular columnar store DBMS) is
> supporting projections.
> The same can be done in VOPS: using create_projection function you can
> specify which attributes should be scalar (grouping attributes) and which
> vectorized.
> In this case you can perform grouping and joins using standard Postgres
> executor, while perform vectorized operations for filtering and
> accumulating aggregates.
>

> This is why Q1 is 20 times faster in VOPS and not 2 times as in your
> prototype.
> So I think that columnar store should make it possible to maintain several
> projections of table and optimizer should be able to automatically choose
> one of them for particular query.
> Definitely synchronization of projections is challenged problem.
> Fortunately OLAP usually not require most recent data.
>

Projection in Vertica is useful. I tested, VOPS is really faster. It could
be nice if you could contribute it to PG core. Our extension is aimed to
not change any Postgres code as well as user's sql and existing table.
We will continue to optimize our vectorize implementation. Vectorized
hashagg need vectorized hashtable implementation, e.g. calculate hashkey in
a batched way, probe hashtable in a batched way. Original hashtable in PG
is not a vectorised hash table of course.

> 3. I wonder if vectorized executor should support only built-in types and
> predefined operators? Or it should be able to work with any user defined
> types, operators and aggregates?
> Certainly it is much easier to support only built-in scalar types. But it
> contradicts to open and extensible nature of Postgres.
>

Yes, we should support user defined type. This could be done by introducing
a register layer which mapping the row type with vector type. E.g.
int4->vint4 and also for each operator.

4. Did you already think about format of storing data in
> VectorTupleTableSlot? Should it be array of Datum? Or we need to represent
> vector in more low level format (for example
> as array of floats for real4 type)?
>

Our perf results show that datum conversion is not effective, and we
prepare to implement to datum array as low level format array as you
mentioned.
--
Thanks

Hubert Zhang

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2019-12-02 01:36:36 Re: Implementing Incremental View Maintenance
Previous Message Tatsuo Ishii 2019-12-02 01:01:18 Re: Implementing Incremental View Maintenance