Re: PostgreSQL, OLAP, and Large Clusters

From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: Ryan Kelly <rpkelly22(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: PostgreSQL, OLAP, and Large Clusters
Date: 2012-09-26 21:18:16
Message-ID: CAOR=d=2Q8bDbABoV5R6T1YZ_7-H7Lzf3nqQUnfa4m2SEiXhWMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Sep 26, 2012 at 5:50 AM, Ryan Kelly <rpkelly22(at)gmail(dot)com> wrote:
> Hi:
>
> The size of our database is growing rather rapidly. We're concerned
> about how well Postgres will scale for OLAP-style queries over terabytes
> of data. Googling around doesn't yield great results for vanilla
> Postgres in this application, but generally links to other software like
> Greenplum, Netezza, and Aster Data (some of which are based off of
> Postgres). Too, there are solutions like Stado. But I'm concerned about
> the amount of effort to use such solutions and what we would have to
> give up feature-wise.

If you want fastish OLAP on postgres you need to do several things.

1: Throw very fast disk arrays at it. Lots of spinners in a linux SW
RAID-10 or RAID-0 if your data is easily replaceable work wonders
here.
2: Throw lots of memory at it. Memory is pretty cheap. 256G is not
unusual for OLAP machines
3: Throw fast CPUs at it. Faster CPUs, especially fewer faster cores,
are often helpful.

Applied in that order you can get some pretty impressive results.

A lot of OLAP stuff needs to read hundreds of gigs at a time from the
drive array. An array of 32 15kRPM drives, each reading at ~100MB/s
or better can flood your PCI bus at 3200MB/s for reads. Note that
most RAID controllers aren't as fast for sequential reads on large
drive arrays. Although a battery backed cache can GREATLY improved
parallel write speed, it doesn't really make a big difference for big
sequential stuff and usually gets in the way here.

Memory to cache as much as possible and allow all your queries to do
hash joins etc in memory (crank up work_mem as needed, but be careful
not to use all your memory up.)

Lastly once you've thrown lots of IO and memory at it, a faster CPU
can make a bit of a difference too. But honestly I'd rather have a
dual core 2GHz CPU on top of 32 spinners with 256G than a 3.6GHz 8
core CPU on top of 4 drives and 32G of RAM.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Shiran Kleiderman 2012-09-26 21:36:24 Re: [GENERAL] Memory issues
Previous Message Scott Marlowe 2012-09-26 21:11:19 Re: [GENERAL] Memory issues