Re: Looking for ideas on how to speed up warehouse loading

From: Sean Shanny <shannyconsulting(at)earthlink(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Looking for ideas on how to speed up warehouse loading
Date: 2004-04-22 23:30:53
Message-ID: 408855AD.4090500@earthlink.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I should have included this as well:

show all;
name | setting
--------------------------------+----------------
add_missing_from | on
australian_timezones | off
authentication_timeout | 60
check_function_bodies | on
checkpoint_segments | 64
checkpoint_timeout | 30
checkpoint_warning | 30
client_encoding | UNICODE
client_min_messages | notice
commit_delay | 0
commit_siblings | 5
cpu_index_tuple_cost | 0.001
cpu_operator_cost | 0.0025
cpu_tuple_cost | 0.01
DateStyle | ISO, MDY
db_user_namespace | off
deadlock_timeout | 1000
debug_pretty_print | off
debug_print_parse | off
debug_print_plan | off
debug_print_rewritten | off
default_statistics_target | 1000
default_transaction_isolation | read committed
default_transaction_read_only | off
dynamic_library_path | $libdir
effective_cache_size | 400000
enable_hashagg | on
enable_hashjoin | on
enable_indexscan | on
enable_mergejoin | on
enable_nestloop | on
enable_seqscan | on
enable_sort | on
enable_tidscan | on
explain_pretty_print | on
extra_float_digits | 0
from_collapse_limit | 8
fsync | on
geqo | on
geqo_effort | 1
geqo_generations | 0
geqo_pool_size | 0
geqo_selection_bias | 2
geqo_threshold | 11
join_collapse_limit | 8
krb_server_keyfile | unset
lc_collate | C
lc_ctype | C
lc_messages | C
lc_monetary | C
lc_numeric | C
lc_time | C
log_connections | off
log_duration | off
log_error_verbosity | default
log_executor_stats | off
log_hostname | off
log_min_duration_statement | -1
log_min_error_statement | panic
log_min_messages | notice
log_parser_stats | off
log_pid | off
log_planner_stats | off
log_source_port | off
log_statement | off
log_statement_stats | off
log_timestamp | on
max_connections | 100
max_expr_depth | 10000
max_files_per_process | 1000
max_fsm_pages | 20000
max_fsm_relations | 1000
max_locks_per_transaction | 64
password_encryption | on
port | 5432
pre_auth_delay | 0
preload_libraries | unset
random_page_cost | 4
regex_flavor | advanced
rendezvous_name | unset
search_path | $user,public
server_encoding | UNICODE
server_version | 7.4.1
shared_buffers | 4000
silent_mode | off
sort_mem | 64000
sql_inheritance | on
ssl | off
statement_timeout | 0
stats_block_level | on
stats_command_string | on
stats_reset_on_server_start | off
stats_row_level | on
stats_start_collector | on
superuser_reserved_connections | 2
syslog | 0
syslog_facility | LOCAL0
syslog_ident | postgres
tcpip_socket | on
TimeZone | unknown
trace_notify | off
transaction_isolation | read committed
transaction_read_only | off
transform_null_equals | off
unix_socket_directory | unset
unix_socket_group | unset
unix_socket_permissions | 511
vacuum_mem | 64000
virtual_host | unset
wal_buffers | 1024
wal_debug | 0
wal_sync_method | open_sync
zero_damaged_pages | off

Sean Shanny wrote:

> To all,
>
> Essentials: Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
> memory, direct attached via fibre channel to a fully optioned 3.5TB
> XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.
>
> Background: We are loading what are essentially xml based access logs
> from about 20+ webservers daily, about 6GB of raw data. We have a
> classic star schema. All the ETL tools are custom java code or
> standard *nix tools like sort, uniq etc...
>
> The problem: We have about 46 million rows in a table with the
> following schema:
>
> Table "public.d_referral"
> Column | Type | Modifiers
> --------------------+---------+-----------
> id | integer | not null
> referral_raw_url | text | not null
> job_control_number | integer | not null
> Indexes:
> "d_referral_pkey" primary key, btree (id)
> "idx_referral_url" btree (referral_raw_url)
>
> This is one of our dimension tables. Part of the daily ETL process is
> to match all the new referral URL's against existing data in the
> d_referral table. Some of the values in referral_raw_url can be 5000
> characters long :-( . The avg length is : 109.57 characters.
>
> I sort and uniq all the incoming referrals and load them into a temp
> table.
>
> Table "public.referral_temp"
> Column | Type | Modifiers
> --------+------+-----------
> url | text | not null
> Indexes:
> "referral_temp_pkey" primary key, btree (url)
>
> I then do a left join
>
> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral
> t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id
>
> This is the output from an explain analyze (Please note that I do a
> set enable_index_scan = false prior to issuing this because it takes
> forever using indexes.):
>
> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
>
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Sort (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
> time=1297320.823..1297739.813 rows=476176 loops=1)
> Sort Key: t1.id
> -> Hash Left Join (cost=1052345.95..3969623.10 rows=451856
> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
> Hash Cond: ("outer".url = "inner".referral_raw_url)
> -> Seq Scan on referral_temp t2 (cost=0.00..6645.56
> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
> -> Hash (cost=729338.16..729338.16 rows=46034716 width=124)
> (actual time=1146440.710..1146440.710 rows=0 loops=1)
> -> Seq Scan on d_referral t1 (cost=0.00..729338.16
> rows=46034716 width=124) (actual time=14.502..-1064277.123
> rows=46034715 loops=1)
> Total runtime: 1298153.193 ms
> (8 rows)
>
>
>
> What I would like to know is if there are better ways to do the join?
> I need to get all the rows back from the referral_temp table as they
> are used for assigning FK's for the fact table later in processing.
> When I iterate over the values that I get back those with t1.id = null
> I assign a new FK and push both into the d_referral table as new
> entries as well as a text file for later use. The matching records
> are written to a text file for later use.
> If we cannot improve the join performance my question becomes are
> there better tools to match up the 46 million and growing at the rate
> of 1 million every 3 days, strings outside of postgresql? We don't
> want to have to invest in zillions of dollars worth of hardware but if
> we have to we will. I just want to make sure we have all the non
> hardware possibilities for improvement covered before we start
> investing in large disk arrays.
> Thanks.
>
> --sean
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
> your
> joining column's datatypes do not match
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Nicholas Shanny 2004-04-23 00:54:15 Re: Looking for ideas on how to speed up warehouse loading
Previous Message CoL 2004-04-22 23:05:20 Re: Looking for ideas on how to speed up warehouse loading