Question about disk IO an index use and seeking advice

From: "Nikolas Everett" <nik9000(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Question about disk IO an index use and seeking advice
Date: 2008-04-24 13:59:20
Message-ID: d4e11e980804240659i2ab46b7fp3e90467d840cf6d4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have a question about index us and IO and am seeking advice.

We are running postgres 8.2. We have two big big tables. Our ~600,000,000
row table is changed very infrequently and is on a 12 disk software raid-6
for historical reasons using an LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS Our ~50,000,000 row staging table is on a 12 disk hardware
raid-10 using a Dell PowerEdge Expandable RAID controller 5. All of the
rows in the staging table are changed at least once and then deleted and
recreated in the bigger table. All of the staging table's indexes are on
the raid-10. The postgres data directory itself is on the raid-6. I think
all the disks are SATA 10Ks. The setup is kind of a beast.

So my disk IO and index question. When I issue a query on the big table
like this:
SELECT column, count(*)
FROM bigtable
GROUP BY column
ORDER BY count DESC
When I run dstat to see my disk IO I see the software raid-6 consistently
holding over 70M/sec. This is fine with me, but I generally don't like to
do queries that table scan 600,000,000 rows. So I do:
SELECT column, count(*)
FROM bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
When I run dstat I see only around 2M/sec and it is not consistent at all.

So my question is, why do I see such low IO load on the index scan version?
If I could tweak some setting to make more aggressive use of IO, would it
actually make the query faster? The field I'm scanning has a .960858
correlation, but I haven't vacuumed since importing any of the data that I'm
scanning, though the correlation should remain very high. When I do a
similar set of queries on the hardware raid I see similar performance
except the numbers are both more than doubled.

Here is the explain output for the queries:
SELECT column, count(*)
FROM bigtable
GROUP BY column
ORDER BY count DESC
"Sort (cost=74404440.58..74404444.53 rows=1581 width=10)"
" Sort Key: count(*)"
" -> HashAggregate (cost=74404336.81..74404356.58 rows=1581 width=10)"
" -> Seq Scan on bigtable (cost=0.00..71422407.21 rows=596385921
width=10)"
---------------
SELECT column, count(*)
FROM bigtable
WHERE date > '4-24-08'
GROUP BY column
ORDER BY count DESC
"Sort (cost=16948.80..16948.81 rows=1 width=10)"
" Sort Key: count(*)"
" -> HashAggregate (cost=16948.78..16948.79 rows=1 width=10)"
" -> Index Scan using date_idx on bigtable (cost=0.00..16652.77
rows=59201 width=10)"
" Index Cond: (date > '2008-04-21 00:00:00'::timestamp without
time zone)"

So now the asking for advice part. I have two questions:
What is the fastest way to copy data from the smaller table to the larger
table?

We plan to rearrange the setup when we move to Postgres 8.3. We'll probably
move all the storage over to a SAN and slice the larger table into monthly
or weekly tables. Can someone point me to a good page on partitioning? My
gut tells me it should be better, but I'd like to learn more about why.
Does anyone have experience migrating large databases to a SAN? I
understand that it'll give me better fail over capabilities so long as the
SAN itself doesn't go out, but are we going to be sacrificing performance
for this? That doesn't even mention the cost....

Thanks so much for reading through all this,

--Nik

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Viktor Rosenfeld 2008-04-24 14:31:58 Re: Performance of the Materialize operator in a query plan
Previous Message Andrew Sullivan 2008-04-24 13:52:19 Re: Sun Talks about MySQL