Skip site navigation (1) Skip section navigation (2)

Re: General performance questions about postgres on Apple

From: Sean Shanny <shannyconsulting(at)earthlink(dot)net>
To: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org
Subject: Re: General performance questions about postgres on Apple
Date: 2004-02-23 16:50:50
Message-ID: 403A2F6A.7000300@earthlink.net (view raw or flat)
Thread:
Lists: pgsql-performance
Scott,

We did try clustering on the date_key for the fact table below for a 
months worth of data as most of our requests for data are date range 
based, i.e. get me info for the time period between 2004-02-01 and 
2004-02-07.  This normally results in a plan that is doing an index scan 
on the date_key which in theory should be fast.  However we have found 
that it is almost always faster to run a sequential scan on the data set 
due to the size, and probably as Tom pointed out, the high seek time we 
seem to be experiencing with the MAC hardware which kills us when using 
the index to pop all over the disk.  We saw no improvement after having 
clustered based on the date_key.

I am certainly open to any suggestions on how to deal with speed issues 
on these sorts of large tables, it isn't going to go away for us. :-( 

We are working on trying to make the table below smaller in record size 
so we can get more records in a page.  An example is we are removing the 
subscriber_key which is 32 characters wide and replacing it with an int 
(user_id) which is an FK to a dimension table. 

I welcome any advice from folks that have used postgres to build data 
warehouses.

Thanks.

--sean


 Table "public.f_pageviews"
         Column         |  Type   |                          Modifiers
------------------------+---------+-------------------------------------------------------------
 id                     | integer | not null default 
nextval('public.f_pageviews_id_seq'::text)
 date_key               | integer | not null
 time_key               | integer | not null
 content_key            | integer | not null
 location_key           | integer | not null
 session_key            | integer | not null
 subscriber_key         | text    | not null
 persistent_cookie_key  | integer | not null
 ip_key                 | integer | not null
 referral_key           | integer | not null
 servlet_key            | integer | not null
 tracking_key           | integer | not null
 provider_key           | text    | not null
 marketing_campaign_key | integer | not null
 orig_airport           | text    | not null
 dest_airport           | text    | not null
 commerce_page          | boolean | not null default false
 job_control_number     | integer | not null
 sequenceid             | integer | not null default 0
 url_key                | integer | not null
 useragent_key          | integer | not null
 web_server_name        | text    | not null default 'Not Available'::text
 cpc                    | integer | not null default 0
 referring_servlet_key  | integer | not null default 1
 first_page_key         | integer | not null default 1
 newsletterid_key       | text    | not null default 'Not Available'::text
 userid_key             | integer |
Indexes:
    "f_pageviews_pkey" primary key, btree (id)
    "idx_pageviews_date" btree (date_key)
    "idx_pageviews_session" btree (session_key)


scott.marlowe wrote:

>On Sun, 22 Feb 2004, Sean Shanny wrote:
>
>  
>
>>Tom,
>>
>>We have the following setting for random page cost:
>>
>>random_page_cost = 1            # units are one sequential page fetch cost
>>
>>Any suggestions on what to bump it up to?
>>
>>We are waiting to hear back from Apple on the speed issues, so far we 
>>are not impressed with the hardware in helping in the IO department.  
>>Our DB is about 263GB with indexes now so there is not way it is going 
>>to fit into memory. :-(  I have taken the step of breaking out the data 
>>into month based groups just to keep the table sizes down.  Our current 
>>months table has around 72 million rows in it as of today.  The joys of 
>>building a data warehouse and trying to make it as fast as possible.
>>    
>>
>
>You may be able to achieve similar benefits with a clustered index.
>
>see cluster:
>
>\h cluster
>Command:     CLUSTER
>Description: cluster a table according to an index
>Syntax:
>CLUSTER indexname ON tablename
>CLUSTER tablename
>CLUSTER
>
>I've found this can greatly increase speed, but on 263 gigs of data, I'd 
>run it when you had a couple days free.  You might wanna test it on a 
>smaller test set you can afford to chew up some I/O CPU time on over a 
>weekend.
>
>
>  
>

In response to

Responses

pgsql-performance by date

Next:From: andrewDate: 2004-02-23 20:00:53
Subject: Re: JOIN order, 15K, 15K, 7MM rows
Previous:From: scott.marloweDate: 2004-02-23 16:25:13
Subject: Re: General performance questions about postgres on Apple

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group