Re: I'd like to discuss scaleout at PGCon

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: I'd like to discuss scaleout at PGCon
Date: 2018-06-06 09:11:02
Message-ID: CAFj8pRDRdUn-PsD5A8aYHt4JbsmHqUnGHsx5KE_35GhHmbZh+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2018-06-06 10:58 GMT+02:00 Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>:

>
>
> On 05.06.2018 20:17, MauMau wrote:
>
>> From: Merlin Moncure
>>
>>> FWIW, Distributed analytical queries is the right market to be in.
>>> This is the field in which I work, and this is where the action is
>>>
>> at.
>>
>>> I am very, very, sure about this. My view is that many of the
>>> existing solutions to this problem (in particular hadoop class
>>> soltuions) have major architectural downsides that make them
>>> inappropriate in use cases that postgres really shines at; direct
>>> hookups to low latency applications for example. postgres is
>>> fundamentally a more capable 'node' with its multiple man-millennia
>>>
>> of
>>
>>> engineering behind it. Unlimited vertical scaling (RAC etc) is
>>> interesting too, but this is not the way the market is moving as
>>> hardware advancements have reduced or eliminated the need for that
>>>
>> in
>>
>>> many spheres.
>>>
>> I'm feeling the same. As the Moore's Law ceases to hold, software
>> needs to make most of the processor power. Hadoop and Spark are
>> written in Java and Scala. According to Google [1] (see Fig. 8), Java
>> is slower than C++ by 3.7x - 12.6x, and Scala is slower than C++ by
>> 2.5x - 3.6x.
>>
>> Won't PostgreSQL be able to cover the workloads of Hadoop and Spark
>> someday, when PostgreSQL supports scaleout, in-memory database,
>> multi-model capability, and in-database filesystem? That may be a
>> pipedream, but why do people have to tolerate the separation of the
>> relational-based data warehouse and Hadoop-based data lake?
>>
>>
>> [1] Robert Hundt. "Loop Recognition in C++/Java/Go/Scala".
>> Proceedings of Scala Days 2011
>>
>> Regards
>> MauMau
>>
>>
>> I can not completely agree with it. I have done a lot of benchmarking of
> PostgreSQL, CitusDB, SparkSQL and native C/Scala code generated for TPC-H
> queries.
> The picture is not so obvious... All this systems provides different
> scalability and so shows best performance at different hardware
> configurations.
> Also Java JIT has made a good progress since 2011. Calculation intensive
> code (like matrix multiplication) implemented in Java is about 2 times
> slower than optimized C code.
> But DBMSes are rarely CPU bounded. Even if all database fits in memory
> (which is not so common scenario for big data applications), speed of
> modern CPU is much higher than RAM access speed... Java application are
> slower than C/C++ mostly because of garbage collection. This is why
> SparkSQL is moving to off-heap approach when objects are allocated outside
> Java heap and so not affecting Java GC. New versions of SparkSQL with
> off-heap memory and native code generation show very good performance. And
> high scalability always was one of the major features of SparkSQL.
>
> So it is naive to expect that Postgres will be 4 times faster than
> SparkSQL on analytic queries just because it is written in C and SparkSQL -
> in Scala.
> Postgres has made a very good progress in support of OLAP in last
> releases: it now supports parallel query execution, JIT, partitioning...
> But still its scalability is very limited comparing with SparkSQL. I am
> not sure about GreenPlum with its sophisticated distributed query
> optimizer, but
> most of other OLAP solutions for Postgres are not able to efficiently
> handle complex queries (with a lot of joins by non-partitioning keys).
>
> I do not want to say that it is not possible to implement good analytic
> platform for OLAP on top of Postgres. But it is very challenged task.
> And IMHO choice of programming language is not so important. What is more
> important is format of storing data. The bast systems for data analytic:
> Vartica, HyPer, KDB,...
> are using vertical data mode. SparkSQL is also using Parquet file format
> which provides efficient extraction and processing of data.
> With abstract storage API Postgres is also given a chance to implement
> efficient storage for OLAP data processing. But huge amount of work has to
> be done here.
>

Unfortunately, storage is one factor. For good performance columnar
storages needs different executor. Although smart columnar storage can get
very good compress ratio, so can has sense self.

Regards

Pavel

> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2018-06-06 09:30:12 Re: libpq compression
Previous Message Amit Kapila 2018-06-06 09:09:35 Re: Loaded footgun open_datasync on Windows