Re: Looking for ideas on how to speed up warehouse loading

From: Nicholas Shanny <nshanny(at)tripadvisor(dot)com>
To: Sean Shanny <shannyconsulting(at)earthlink(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Looking for ideas on how to speed up warehouse loading
Date: 2004-04-23 00:54:15
Message-ID: BE0C810C-94C0-11D8-B96E-000A95770900@tripadvisor.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

One other thing: we are running with a block size of 32K.

Nick Shanny
(Brother of above person)

On Apr 22, 2004, at 7:30 PM, Sean Shanny wrote:

> I should have included this as well:
>
> show all;
> name | setting
> --------------------------------+----------------
> add_missing_from | on
> australian_timezones | off
> authentication_timeout | 60
> check_function_bodies | on
> checkpoint_segments | 64
> checkpoint_timeout | 30
> checkpoint_warning | 30
> client_encoding | UNICODE
> client_min_messages | notice
> commit_delay | 0
> commit_siblings | 5
> cpu_index_tuple_cost | 0.001
> cpu_operator_cost | 0.0025
> cpu_tuple_cost | 0.01
> DateStyle | ISO, MDY
> db_user_namespace | off
> deadlock_timeout | 1000
> debug_pretty_print | off
> debug_print_parse | off
> debug_print_plan | off
> debug_print_rewritten | off
> default_statistics_target | 1000
> default_transaction_isolation | read committed
> default_transaction_read_only | off
> dynamic_library_path | $libdir
> effective_cache_size | 400000
> enable_hashagg | on
> enable_hashjoin | on
> enable_indexscan | on
> enable_mergejoin | on
> enable_nestloop | on
> enable_seqscan | on
> enable_sort | on
> enable_tidscan | on
> explain_pretty_print | on
> extra_float_digits | 0
> from_collapse_limit | 8
> fsync | on
> geqo | on
> geqo_effort | 1
> geqo_generations | 0
> geqo_pool_size | 0
> geqo_selection_bias | 2
> geqo_threshold | 11
> join_collapse_limit | 8
> krb_server_keyfile | unset
> lc_collate | C
> lc_ctype | C
> lc_messages | C
> lc_monetary | C
> lc_numeric | C
> lc_time | C
> log_connections | off
> log_duration | off
> log_error_verbosity | default
> log_executor_stats | off
> log_hostname | off
> log_min_duration_statement | -1
> log_min_error_statement | panic
> log_min_messages | notice
> log_parser_stats | off
> log_pid | off
> log_planner_stats | off
> log_source_port | off
> log_statement | off
> log_statement_stats | off
> log_timestamp | on
> max_connections | 100
> max_expr_depth | 10000
> max_files_per_process | 1000
> max_fsm_pages | 20000
> max_fsm_relations | 1000
> max_locks_per_transaction | 64
> password_encryption | on
> port | 5432
> pre_auth_delay | 0
> preload_libraries | unset
> random_page_cost | 4
> regex_flavor | advanced
> rendezvous_name | unset
> search_path | $user,public
> server_encoding | UNICODE
> server_version | 7.4.1
> shared_buffers | 4000
> silent_mode | off
> sort_mem | 64000
> sql_inheritance | on
> ssl | off
> statement_timeout | 0
> stats_block_level | on
> stats_command_string | on
> stats_reset_on_server_start | off
> stats_row_level | on
> stats_start_collector | on
> superuser_reserved_connections | 2
> syslog | 0
> syslog_facility | LOCAL0
> syslog_ident | postgres
> tcpip_socket | on
> TimeZone | unknown
> trace_notify | off
> transaction_isolation | read committed
> transaction_read_only | off
> transform_null_equals | off
> unix_socket_directory | unset
> unix_socket_group | unset
> unix_socket_permissions | 511
> vacuum_mem | 64000
> virtual_host | unset
> wal_buffers | 1024
> wal_debug | 0
> wal_sync_method | open_sync
> zero_damaged_pages | off
>
>
> Sean Shanny wrote:
>
>> To all,
>>
>> Essentials: Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
>> memory, direct attached via fibre channel to a fully optioned 3.5TB
>> XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.
>>
>> Background: We are loading what are essentially xml based access
>> logs from about 20+ webservers daily, about 6GB of raw data. We have
>> a classic star schema. All the ETL tools are custom java code or
>> standard *nix tools like sort, uniq etc...
>>
>> The problem: We have about 46 million rows in a table with the
>> following schema:
>>
>> Table "public.d_referral"
>> Column | Type | Modifiers
>> --------------------+---------+-----------
>> id | integer | not null
>> referral_raw_url | text | not null
>> job_control_number | integer | not null
>> Indexes:
>> "d_referral_pkey" primary key, btree (id)
>> "idx_referral_url" btree (referral_raw_url)
>>
>> This is one of our dimension tables. Part of the daily ETL process
>> is to match all the new referral URL's against existing data in the
>> d_referral table. Some of the values in referral_raw_url can be 5000
>> characters long :-( . The avg length is : 109.57 characters.
>>
>> I sort and uniq all the incoming referrals and load them into a temp
>> table.
>>
>> Table "public.referral_temp"
>> Column | Type | Modifiers
>> --------+------+-----------
>> url | text | not null
>> Indexes:
>> "referral_temp_pkey" primary key, btree (url)
>>
>> I then do a left join
>>
>> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral
>> t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id
>>
>> This is the output from an explain analyze (Please note that I do a
>> set enable_index_scan = false prior to issuing this because it takes
>> forever using indexes.):
>>
>> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
>> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
>>
>> QUERY PLAN
>> ----------------------------------------------------------------------
>> ----------------------------------------------------------------------
>> ----------
>> Sort (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
>> time=1297320.823..1297739.813 rows=476176 loops=1)
>> Sort Key: t1.id
>> -> Hash Left Join (cost=1052345.95..3969623.10 rows=451856
>> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
>> Hash Cond: ("outer".url = "inner".referral_raw_url)
>> -> Seq Scan on referral_temp t2 (cost=0.00..6645.56
>> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176
>> loops=1)
>> -> Hash (cost=729338.16..729338.16 rows=46034716 width=124)
>> (actual time=1146440.710..1146440.710 rows=0 loops=1)
>> -> Seq Scan on d_referral t1 (cost=0.00..729338.16
>> rows=46034716 width=124) (actual time=14.502..-1064277.123
>> rows=46034715 loops=1)
>> Total runtime: 1298153.193 ms
>> (8 rows)
>>
>>
>>
>> What I would like to know is if there are better ways to do the join?
>> I need to get all the rows back from the referral_temp table as they
>> are used for assigning FK's for the fact table later in processing.
>> When I iterate over the values that I get back those with t1.id =
>> null I assign a new FK and push both into the d_referral table as new
>> entries as well as a text file for later use. The matching records
>> are written to a text file for later use.
>> If we cannot improve the join performance my question becomes are
>> there better tools to match up the 46 million and growing at the rate
>> of 1 million every 3 days, strings outside of postgresql? We don't
>> want to have to invest in zillions of dollars worth of hardware but
>> if we have to we will. I just want to make sure we have all the non
>> hardware possibilities for improvement covered before we start
>> investing in large disk arrays.
>> Thanks.
>>
>> --sean
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>> joining column's datatypes do not match
>>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2004-04-23 02:03:12 Re: Looking for ideas on how to speed up warehouse loading
Previous Message Sean Shanny 2004-04-22 23:30:53 Re: Looking for ideas on how to speed up warehouse loading