Re: Huge Data sets, simple queries

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Mike Biamonte" <mike(at)dbeat(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Huge Data sets, simple queries
Date: 2006-01-28 03:05:04
Message-ID: C0001F60.1B354%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Sounds like you are running into the limits of your disk subsystem. You are
scanning all of the data in the transactions table, so you will be limited
by the disk bandwidth you have ­ and using RAID-10, you should divide the
number of disk drives by 2 and multiply by their indiividual bandwidth
(around 60MB/s) and that¹s what you can expect in terms of performance. So,
if you have 8 drives, you should expect to get 4 x 60 MB/s = 240 MB/s in
bandwidth. That means that if you are dealing with 24,000 MB of data in the
³transactions² table, then you will scan it in 100 seconds.

With a workload like this, you are in the realm of business intelligence /
data warehousing I think. You should check your disk performance, I would
expect you¹ll find it lacking, partly because you are running RAID10, but
mostly because I expect you are using a hardware RAID adapter.

- Luke

On 1/27/06 5:23 PM, "Mike Biamonte" <mike(at)dbeat(dot)com> wrote:

>
>
>
> Does anyone have any experience with extremely large data sets?
> I'm mean hundreds of millions of rows.
>
> The queries I need to run on my 200 million transactions are relatively
> simple:
>
> select month, count(distinct(cardnum)) count(*), sum(amount) from
> transactions group by month;
>
> This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with
> RAID-10 (15K drives)
> and 12 GB Ram. I was expecting it to take about 4 hours - based on some
> experience with a
> similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
> Raid-5 10K drives)
>
> This machine is COMPLETELY devoted to running these relatively simple
> queries one at a
> time. (No multi-user support needed!) I've been tooling with the various
> performance settings:
> effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB
> each.
> ( Shared buffers puzzles me a it bit - my instinct says to set it as high as
> possible,
> but everything I read says that "too high" can hurt performance.)
>
> Any ideas for performance tweaking in this kind of application would be
> greatly appreciated.
> We've got indexes on the fields being grouped, and always vacuum analzye
> after building them.
>
> It's difficult to just "try" various ideas because each attempt takes a
> full day to test. Real
> experience is needed here!
>
> Thanks much,
>
> Mike
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeffrey W. Baker 2006-01-28 06:50:00 Re: Huge Data sets, simple queries
Previous Message Mike Biamonte 2006-01-28 01:23:55 Huge Data sets, simple queries