Re: Large Database Performance suggestions

From: Joshua Marsh <icub3d(at)gmail(dot)com>
To: pg(at)fastcrypt(dot)com
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Large Database Performance suggestions
Date: 2004-10-26 15:24:08
Message-ID: 38242de904102608246d060f7d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks for all of your help so far. Here is some of the information
you guys were asking for:

Test System:
2x AMD Opteron 244 (1.8Ghz)
8GB RAM
7x 72GB SCSI HDD (Raid 5)

postrgesql.conf information:
#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

# - Memory -

shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
#sort_mem = 1024 # min 64, size in KB
#vacuum_mem = 8192 # min 1024, size in KB
sort_mem = 4096000
vacuum_mem = 1024000

# - Free Space Map -

#max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes each
#max_fsm_relations = 1000 # min 100, ~50 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000 # min 25
#preload_libraries = ''

#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

#fsync = true # turns forced synchronization on or off
#wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, open_sync, or open_datasync
#wal_buffers = 8 # min 4, 8KB each

# - Checkpoints -

#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300 # range 30-3600, in seconds
#checkpoint_warning = 30 # 0 is off, in seconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000

Everything else are at their defaults. I actually think the WAL
options are set to defaults as well, but I don't recall exactly :)

As for the queries and table, The data we store is confidential, but
it is essentially an account number with a bunch of boolean fields
that specify if a person applies to criteria. So a query may look
something like:

SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
AND state = 'OH';

which is explained as something like this:
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on view_of_data (cost=0.00..25304.26 rows=22054 width=11)
Filter: (has_name AND is_active_member AND ((state)::text = 'OH'::text))
(2 rows)

Occasionally, because we store data from several sources, we will have
requests for data from several sources. We simply intersect the
view_of_data table with a sources table that lists what acctno belong
to what source. This query would look something like this:

SELECT acctno FROM view_of_data WHERE has_name AND is_active_member
AND state = 'OH' INTERSECT SELECT acctno FROM sources_data WHERE
source = 175;

which is explained as follows:
QUERY PLAN
-------------------------------------------------------------------------------------------
SetOp Intersect (cost=882226.14..885698.20 rows=69441 width=11)
-> Sort (cost=882226.14..883962.17 rows=694411 width=11)
Sort Key: acctno
-> Append (cost=0.00..814849.42 rows=694411 width=11)
-> Subquery Scan "*SELECT* 1" (cost=0.00..25524.80
rows=22054 width=11)
-> Seq Scan on view_of_data
(cost=0.00..25304.26 rows=22054 width=11)
Filter: (has_name AND is_active_member AND
((state)::text = 'OH'::text))
-> Subquery Scan "*SELECT* 2" (cost=0.00..789324.62
rows=672357 width=11)
-> Seq Scan on sources_data
(cost=0.00..782601.05 rows=672357 width=11)
Filter: (source = 23)

Again, we see our biggest bottlenecks when we get over about 50
million records. The time to execute grows exponentially from that
point.

Thanks again for all of your help!

-Josh

On Fri, 22 Oct 2004 07:38:49 -0400, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
> Josh,
>
> Your hardware setup would be useful too. It's surprising how slow some
> big name servers really are.
> If you are seriously considering memory sizes over 4G you may want to
> look at an opteron.
>
> Dave
>
>
>
> Joshua Marsh wrote:
>
> >Hello everyone,
> >
> >I am currently working on a data project that uses PostgreSQL
> >extensively to store, manage and maintain the data. We haven't had
> >any problems regarding database size until recently. The three major
> >tables we use never get bigger than 10 million records. With this
> >size, we can do things like storing the indexes or even the tables in
> >memory to allow faster access.
> >
> >Recently, we have found customers who are wanting to use our service
> >with data files between 100 million and 300 million records. At that
> >size, each of the three major tables will hold between 150 million and
> >700 million records. At this size, I can't expect it to run queries
> >in 10-15 seconds (what we can do with 10 million records), but would
> >prefer to keep them all under a minute.
> >
> >We did some original testing and with a server with 8GB or RAM and
> >found we can do operations on data file up to 50 million fairly well,
> >but performance drop dramatically after that. Does anyone have any
> >suggestions on a good way to improve performance for these extra large
> >tables? Things that have come to mind are Replication and Beowulf
> >clusters, but from what I have recently studied, these don't do so wel
> >with singular processes. We will have parallel process running, but
> >it's more important that the speed of each process be faster than
> >several parallel processes at once.
> >
> >Any help would be greatly appreciated!
> >
> >Thanks,
> >
> >Joshua Marsh
> >
> >P.S. Off-topic, I have a few invitations to gmail. If anyone would
> >like one, let me know.
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
> >
> >
> >
> >
>
> --
> Dave Cramer
> http://www.postgresintl.com
> 519 939 0336
> ICQ#14675561
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Stark 2004-10-26 15:30:23 Re: [PATCHES] ARC Memory Usage analysis
Previous Message Simon Riggs 2004-10-26 12:18:25 Re: [PATCHES] ARC Memory Usage analysis