Quick Links

Re: Looking for ideas on how to speed up warehouse loading

From:	Sean Shanny <shannyconsulting(at)earthlink(dot)net>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Looking for ideas on how to speed up warehouse loading
Date:	2004-04-22 23:30:53
Message-ID:	408855AD.4090500@earthlink.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

I should have included this as well:

Sean Shanny wrote:

> To all,
>
> Essentials: Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
> memory, direct attached via fibre channel to a fully optioned 3.5TB
> XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.
>
> Background: We are loading what are essentially xml based access logs
> from about 20+ webservers daily, about 6GB of raw data. We have a
> classic star schema. All the ETL tools are custom java code or
> standard *nix tools like sort, uniq etc...
>
> The problem: We have about 46 million rows in a table with the
> following schema:
>
> Table "public.d_referral"
> Column | Type | Modifiers
> --------------------+---------+-----------
> id | integer | not null
> referral_raw_url | text | not null
> job_control_number | integer | not null
> Indexes:
> "d_referral_pkey" primary key, btree (id)
> "idx_referral_url" btree (referral_raw_url)
>
> This is one of our dimension tables. Part of the daily ETL process is
> to match all the new referral URL's against existing data in the
> d_referral table. Some of the values in referral_raw_url can be 5000
> characters long :-( . The avg length is : 109.57 characters.
>
> I sort and uniq all the incoming referrals and load them into a temp
> table.
>
> Table "public.referral_temp"
> Column | Type | Modifiers
> --------+------+-----------
> url | text | not null
> Indexes:
> "referral_temp_pkey" primary key, btree (url)
>
> I then do a left join
>
> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral
> t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id
>
> This is the output from an explain analyze (Please note that I do a
> set enable_index_scan = false prior to issuing this because it takes
> forever using indexes.):
>
> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
>
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Sort (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
> time=1297320.823..1297739.813 rows=476176 loops=1)
> Sort Key: t1.id
> -> Hash Left Join (cost=1052345.95..3969623.10 rows=451856
> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
> Hash Cond: ("outer".url = "inner".referral_raw_url)
> -> Seq Scan on referral_temp t2 (cost=0.00..6645.56
> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
> -> Hash (cost=729338.16..729338.16 rows=46034716 width=124)
> (actual time=1146440.710..1146440.710 rows=0 loops=1)
> -> Seq Scan on d_referral t1 (cost=0.00..729338.16
> rows=46034716 width=124) (actual time=14.502..-1064277.123
> rows=46034715 loops=1)
> Total runtime: 1298153.193 ms
> (8 rows)
>
>
>
> What I would like to know is if there are better ways to do the join?
> I need to get all the rows back from the referral_temp table as they
> are used for assigning FK's for the fact table later in processing.
> When I iterate over the values that I get back those with t1.id = null
> I assign a new FK and push both into the d_referral table as new
> entries as well as a text file for later use. The matching records
> are written to a text file for later use.
> If we cannot improve the join performance my question becomes are
> there better tools to match up the 46 million and growing at the rate
> of 1 million every 3 days, strings outside of postgresql? We don't
> want to have to invest in zillions of dollars worth of hardware but if
> we have to we will. I just want to make sure we have all the non
> hardware possibilities for improvement covered before we start
> investing in large disk arrays.
> Thanks.
>
> --sean
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
> your
> joining column's datatypes do not match
>

In response to

Looking for ideas on how to speed up warehouse loading at 2004-04-22 21:56:59 from Sean Shanny

Responses

Re: Looking for ideas on how to speed up warehouse loading at 2004-04-23 00:54:15 from Nicholas Shanny
Re: Looking for ideas on how to speed up warehouse loading at 2004-04-23 16:25:01 from scott.marlowe

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Nicholas Shanny	2004-04-23 00:54:15	Re: Looking for ideas on how to speed up warehouse loading
Previous Message	CoL	2004-04-22 23:05:20	Re: Looking for ideas on how to speed up warehouse loading