Re: I'd like to discuss scaleout at PGCon

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: I'd like to discuss scaleout at PGCon
Date: 2018-06-06 08:58:51
Message-ID: a147739a-dd03-73e1-0187-1bafa14dec5e@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05.06.2018 20:17, MauMau wrote:
> From: Merlin Moncure
>> FWIW, Distributed analytical queries is the right market to be in.
>> This is the field in which I work, and this is where the action is
> at.
>> I am very, very, sure about this. My view is that many of the
>> existing solutions to this problem (in particular hadoop class
>> soltuions) have major architectural downsides that make them
>> inappropriate in use cases that postgres really shines at; direct
>> hookups to low latency applications for example. postgres is
>> fundamentally a more capable 'node' with its multiple man-millennia
> of
>> engineering behind it. Unlimited vertical scaling (RAC etc) is
>> interesting too, but this is not the way the market is moving as
>> hardware advancements have reduced or eliminated the need for that
> in
>> many spheres.
> I'm feeling the same. As the Moore's Law ceases to hold, software
> needs to make most of the processor power. Hadoop and Spark are
> written in Java and Scala. According to Google [1] (see Fig. 8), Java
> is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
> 2.5x - 3.6x.
>
> Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
> someday, when PostgreSQL supports scaleout, in-memory database,
> multi-model capability, and in-database filesystem? That may be a
> pipedream, but why do people have to tolerate the separation of the
> relational-based data warehouse and Hadoop-based data lake?
>
>
> [1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
> Proceedings of Scala Days 2011
>
> Regards
> MauMau
>
>
I can not completely agree with it. I have done a lot of benchmarking of
PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for
TPC-H queries.
The picture is not so obvious... All this systems provides different
scalability and so shows best performance at different hardware
configurations.
Also Java JIT has made a good progress since 2011. Calculation intensive
code (like matrix multiplication) implemented in Java is about 2 times
slower than optimized C code.
But DBMSes are rarely CPU bounded. Even if all database fits in memory
(which is not so common scenario for big data applications), speed of
modern CPU is much higher than RAM access speed... Java application are
slower than C/C++ mostly because of garbage collection. This is why
SparkSQL is moving to off-heap approach when objects are allocated
outside Java heap and so not affecting Java GC.  New versions of
SparkSQL with off-heap memory and native code generation show very good
performance. And high scalability always was one of the major features
of SparkSQL.

So it is naive to expect that Postgres will be 4 times faster than
SparkSQL on analytic queries just because it is written in C and
SparkSQL - in Scala.
Postgres has made a very good progress in support of OLAP in last
releases: it now supports parallel query execution, JIT, partitioning...
But still its scalability is very limited comparing with SparkSQL. I am
not sure about GreenPlum with its sophisticated distributed query
optimizer, but
most of other OLAP solutions for Postgres are not able to efficiently
handle complex queries (with a lot of joins by non-partitioning keys).

I do not want to say that it is not possible to implement good analytic
platform for OLAP on top of Postgres. But it is very challenged task.
And IMHO choice of programming language is not so important. What is
more important is format of storing data. The bast systems for data
analytic: Vartica, HyPer, KDB,...
are using vertical data mode. SparkSQL is also using Parquet file format
which provides efficient extraction and processing of data.
With abstract storage API Postgres is also given a chance to implement
efficient storage for OLAP data processing. But huge amount of work has
to be done here.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2018-06-06 09:09:35 Re: Loaded footgun open_datasync on Windows
Previous Message Simon Riggs 2018-06-06 08:44:07 Re: I'd like to discuss scaleout at PGCon