Re: PostgreSQL, OLAP, and Large Clusters

From: Ryan Kelly <rpkelly22(at)gmail(dot)com>
To: Ondrej Ivanič <ondrej(dot)ivanic(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: PostgreSQL, OLAP, and Large Clusters
Date: 2012-09-27 18:34:41
Message-ID: 20120927183441.GB25296@llserver.lakeliving.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Sep 27, 2012 at 08:51:31AM +1000, Ondrej Ivanič wrote:
> Hi,
>
> On 26 September 2012 21:50, Ryan Kelly <rpkelly22(at)gmail(dot)com> wrote:
> > The size of our database is growing rather rapidly. We're concerned
> > about how well Postgres will scale for OLAP-style queries over terabytes
> > of data. Googling around doesn't yield great results for vanilla
> > Postgres in this application, but generally links to other software like
> > Greenplum, Netezza, and Aster Data (some of which are based off of
> > Postgres). Too, there are solutions like Stado. But I'm concerned about
> > the amount of effort to use such solutions and what we would have to
> > give up feature-wise.
>
> We were able to have 2TB of data (16 Cores, 96GB RAM, FC connected
> SAN). Performance was good:
> - up to 5sec for simple select with multiple where conditions (2 - 25)
> order by any column (of 400 columns) and return top 10k
> - up to 15sec for executing 5 queries (simultaneously) which return
> top 20 combination of any two columns
> - up to 25sec for executing 56 queries (using materialised aggregate tables)
> - aggregation job ran every 15 minutes and completed under 2 minutes:
> 5mil rows -> aggregation -> 56 tables
5mil overall, or matching your aggregation query? And is that the 2TB
mentioned above? We have more than 100 times that many rows, but less
data.

> - all queries can be executed over date range up to several months
> (monthly partitioned tables, 6 months history)
Yeah we have to be able to query over various date ranges.

> but it was very hard to scale this solution. We have tried:
> - FusionIO cards: 10 to 100 times better performance, but very hard to
> expand storage capacity; Cooling/power issues
> - AsterData: nice SQL-MR feature and analytics (decision trees,
> frequent items, clustering, ...); No libpq support and you have to use
> JDBC or selected ODBC manager
I don't think no libpq support is a deal-breaker, but other missing
features could be problematic.

> - Greenplum (winer): performance comparable to FusionIO (10 to 50
> times); we were able to remove aggregation job (because of columnar
> store model); easy to port from postgres but could be complicated if
> you are heavy pgpsql user
Not using any pl/pgpsql, but a number of other features: arrays and
hstore, which I doubt (hopefully wrongly) that Greenplum supports.

> At this time I would try:
> - Postgres-XC
>From what I understand, more of a write-scaleable-oriented solution. We
mostly will need read scalability. I also don't think it really handles
redundancy.

> - Stado
Looks promising, sounded very promising, but it doesn't seem to be
particularly active or well-documented. It also doesn't support window
functions (which I could probably get by without) or CTEs (which will be
trickier, but doable. I'm also not sure of how easy it is to handle node
failure or adding more nodes, as it appears the number of nodes is
essentially fixed.

> - Cassandra + Solr + Hadoop (for example DataStax Enterprise)
> - Birst (http://www.birst.com/)
>
> PS. For any commercial solution is better to buy their appliance (if
> you can afford it...).
Thanks for the advice. Is it just better supported, or more performant,
or...?

>
> --
> Ondrej Ivanic
> (ondrej(dot)ivanic(at)gmail(dot)com)

-Ryan Kelly

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ryan Kelly 2012-09-27 18:47:27 Re: PostgreSQL, OLAP, and Large Clusters
Previous Message David Johnston 2012-09-27 18:05:56 Re: unc paths, like and backslashes on 8.4