Quick Links

Re: Large DB

From:	Ericson Smith <eric(at)did-it(dot)com>
To:	"Mooney, Ryan" <ryan(dot)mooney(at)pnl(dot)gov>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Large DB
Date:	2004-03-31 06:21:47
Message-ID:	406A637B.50407@did-it.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

The issue here might be just organizing the data differently. Or getting
an Opteron server with 16GB RAM :-) Based on the strength of the
developers recommendations in this newsgroup, we recently upgraded to a
dual Opteron 2GHZ with 16GB Ram and 15K hard drives. We set
shared_buffers to 40,000 (just about 320MB Ram), and the difference is
amazing. Just having the OS handle the caching has made all the
difference. You can actually see lots of blocks getting cached by the
OS. (RH Linux Enterprise in our case). In most cases, tables with
millions of records would get entirely cached in RAM, and there would be
no disk access whatsoever for selects in a few minutes.

Based on the queries you run, is it possible to split up the schema into
different tables? Are the differences between timestamps in the sample
query usually small? We had a similar problem, although with a slightly
smaller data set -- but one that was going to keep growing.

Our questions were: how could we scale? What about vacuuming our tables,
running analyze in a decent time? backing up? and so on. We found that
most of the queries we wanted were in the domain of a day. So we
actually split up that giant table and made one for each day. We could
have done it one for each week as well, but the daily tables worked well
for us. Sure, its a bit more work getting data over a long time period,
but those common queries were a cinch.

We've also seen that in cases were we have to dump in thousands of
records every few minutes that select queries respond remarkedly faster
when frequent (one or two every hour) ANALYZE's are done even on those
daily tables which contain just a few mil records each.

Tweaking the hardware IMHO would probably take you just a little
further, but you gotta think about what your response times will be in
another month based on your growth now. Can your schema stand it then?

- Ericson Smith

Mooney, Ryan wrote:

>Hello,
>
>I have a single table that just went over 234GB in size with about 290M+
>rows. I think that I'm starting to approach some limits since things
>have gotten quite a bit slower over the last couple days. The table is
>really simple and I'm mostly doing simple data mining queries like the
>query included below. These have gone from taking a under a few minutes
>to taking tens of minutes over the last week or so (a query like the
>below would generally return 1-6 million rows from the base table). The
>queries use the indexes fairly well, although I suspect that the order
>of host/starttime is suboptimal (fewer hosts than starttime, and the
>table is naturally in starttime order). I'm going to try adding an
>index on just starttime (and later just host) and see if I can tune the
>queries on that more. I never delete rows from the table, only do
>inserts (up to around 11,000/minute mostly in one big burst every
>minute, this is anticipated to go up some over time). There are about
>32 processes doing the inserts (on the same machine - yeah I know it'd
>be happier if they moved); I think it might help if there was only one,
>but for architectural reasons that won't happen for a while.
>
>This is on a dual 3Ghz xenon with 4G Ram and an IDE-SCSI raid array
>(ACNC) I'm running RH Fedora with kernel 2.4.22-1.2115.nptlsmp (we'd
>tried FBSD 4/5 early on, but the insert speeds were actually better with
>RH9 by a ~10% or so - this was pre fbsd 5.2, but it's a bit late to
>migrate easily now).
>
>I'm trying to figure out ways to squeak another ounce or two of
>performance out of this machine, I've included the things I've tuned so
>far below.
>
>The query processes are mostly stuck in D state so I expect that I'm
>hitting some hw limitations, but I'm only doing sub 15MB from the disk
>array (from iostat) and I know it can do in the 40-60MB range when we
>tested the raw speed, and only 400 or so tps which is also well under
>the arrays limits so I suspect that its thrashing a bit, this is also
>indicated by the contrast between rrqm/s (read requests merged per
>second) which is pushing 2000 and the actual r/s (read requests that
>were issued to the device) at around 400 or so (same as tps). I suspect
>that a lot of the time is spent reading indexes, so a "better" indexing
>scheme may be my best bet.
>
>Estimating the table size
>-------------------------
>
>stats=> select relfilenode,relname from pg_class where
>relfilenode=37057796;
> relfilenode | relname
>-------------+---------
> 37057796 | tp3
>
>du -sc 37057796* | grep total
>234002372 total
>
>However the two indexes are also - large (which may be part of the
>problem, which is why I'm trying just starttime for an index; They are
>currently in the 140-150G range).
>
>The query optimizer thinks I have ~ 290M rows (I'm not actually doing a
>real select count since the last time I did that was around 10M rows or
>so and it took a long while, I don't want to wait days :).
>------------------------------
>stats=> explain select count(*) from tp3;
> QUERY PLAN
>-----------------------------------------------------------------------
> Aggregate (cost=7632998.20..7632998.20 rows=1 width=0)
> -> Seq Scan on tp3 (cost=0.00..6906493.16 rows=290602016 width=0)
>(2 rows)
>
>Table def
>----------------------------
>stats=> \d tp3
> Table "public.tp3"
> Column | Type | Modifiers
>-------------+-----------------------------+-----------
> host | character(4) |
> point | character varying(64) |
> type | character(1) |
> cooked | character(1) |
> starttime | timestamp without time zone |
> intervallen | interval |
> arrivetime | timestamp without time zone |
> pvalue | numeric |
>Indexes:
> "tp3_host_starttime" btree (host, starttime, cooked)
> "tp3_point_starttime" btree (point, starttime, cooked)
>
>
>Sample data mining query:
>----------------------------
>select point, avg(pvalue) as avg from tp3 where host in ('node', 'node',
>....) and starttime between 'timestamp' and 'timestamp' group by point
>
>Tuning done so far:
>----------------------------
>$ cat /etc/sysctl.conf
>
>kernel.shmall=805306368
>kernel.shmmax=805306368
>
>$ egrep -v "^#|^$" postgresql.conf
>
>shared_buffers = 60800
>sort_mem = 1286720 # min 64, size in KB
>vacuum_mem = 102400 # min 1024, size in KB
>fsync=false # Play fast and loose - whee
>max_files_per_process = 1000
>wal_buffers = 16
>checkpoint_segments = 20
>checkpoint_timeout = 100
>effective_cache_size = 160000
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>
>

In response to

Large DB at 2004-03-31 01:48:14 from Mooney, Ryan

Browse pgsql-general by date

	From	Date	Subject
Next Message	Teodor Sigaev	2004-03-31 07:18:34	Re: Wich hardware suits best for large full-text indexed
Previous Message	Ericson Smith	2004-03-31 05:59:04	Re: Wich hardware suits best for large full-text indexed