Re: Huge Data sets, simple queries

From: "Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To: Mike Biamonte <mike(at)dbeat(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Huge Data sets, simple queries
Date: 2006-01-28 06:50:00
Message-ID: 1138431000.8630.18.camel@noodles
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, 2006-01-27 at 20:23 -0500, Mike Biamonte wrote:
>
> Does anyone have any experience with extremely large data sets?
> I'm mean hundreds of millions of rows.

Sure, I think more than a few of us do. Just today I built a summary
table from a 25GB primary table with ~430 million rows. This took about
45 minutes.

> The queries I need to run on my 200 million transactions are relatively
> simple:
>
> select month, count(distinct(cardnum)) count(*), sum(amount) from
> transactions group by month;
>
> This query took 18 hours on PG 8.1 on a Dual Xeon, RHEL3, (2.4 Kernel) with
> RAID-10 (15K drives)
> and 12 GB Ram. I was expecting it to take about 4 hours - based on some
> experience with a
> similar dataset on a different machine (RH9, PG7.3 Dual Xeon, 4GB RAM,
> Raid-5 10K drives)

Possibly the latter machine has a faster I/O subsystem. How large is
the table on disk?

> This machine is COMPLETELY devoted to running these relatively simple
> queries one at a
> time. (No multi-user support needed!) I've been tooling with the various
> performance settings:
> effective_cache at 5GB, shared_buffers at 2 GB, workmem, sortmem at 1 GB
> each.
> ( Shared buffers puzzles me a it bit - my instinct says to set it as high as
> possible,
> but everything I read says that "too high" can hurt performance.)
>
> Any ideas for performance tweaking in this kind of application would be
> greatly appreciated.
> We've got indexes on the fields being grouped,
> and always vacuum analzye
> after building them.

Probably vacuum makes no difference.

> It's difficult to just "try" various ideas because each attempt takes a
> full day to test. Real
> experience is needed here!

Can you send us an EXPLAIN of the query? I believe what you're seeing
here is probably:

Aggregate
+-Sort
+-Sequential Scan

or perhaps:

Aggregate
+-Index Scan

I have a feeling that the latter will be much faster. If your table has
been created over time, then it is probably naturally ordered by date,
and therefore also ordered by month. You might expect a Sequential Scan
to be the fastest, but the Sort step will be a killer. On the other
hand, if your table is badly disordered by date, the Index Scan could
also be very slow.

Anyway, send us the query plan and also perhaps a sample of vmstat
during the query.

For what it's worth, I have:

effective_cache_size | 700000
cpu_tuple_cost | 0.01
cpu_index_tuple_cost | 0.001
random_page_cost | 3
shared_buffers | 50000
temp_buffers | 1000
work_mem | 1048576 <= for this query only

And here's a few lines from vmstat during the query:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 1 76 43476 94916 7655148 0 0 78800 0 1662 788 68 12 0 20
1 1 76 45060 91196 7658088 0 0 78028 0 1639 712 71 11 0 19
2 0 76 44668 87624 7662960 0 0 78924 52 1650 736 69 12 0 19
2 0 76 45300 83672 7667432 0 0 83536 16 1688 768 71 12 0 18
1 1 76 45744 80652 7670712 0 0 84052 0 1691 796 70 12 0 17

That's about 80MB/sec sequential input, for comparison purposes.

-jwb

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2006-01-28 15:55:02 Re: Huge Data sets, simple queries
Previous Message Luke Lonergan 2006-01-28 03:05:04 Re: Huge Data sets, simple queries