Including unique users in huge data warehouse in Postgresql...

From: Mark Jensen <musicnyman1974(at)yahoo(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: mark jensen <musicnyman1974(at)yahoo(dot)com>
Subject: Including unique users in huge data warehouse in Postgresql...
Date: 2006-11-28 17:55:44
Message-ID: 20061128175544.65232.qmail@web82108.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

So i've been given the task of designing a data warehouse in
either Postgresql or Mysql for our clickstream data for our sites. I
started with Mysql but the joins in Mysql are just way too slow
compared to Postgresql when playing with star schemas. I can't say
which sites i'm working on, but we get close to 3-5 million uniques
users per day, so over time, that's a lot of unique users to keep
around and de-dup your fact tables by. Need to be able to query normal
analytics like:

- Unique users per time (day/month/week/ etc)
- Unique users per area of the site
-
Unique users by anything really that we have in the dimension tables
and that we know about. it's all about unique users at all times with
them. so any report i do, unique users are usually included in
them. so it's hard to keep summary tables around since it's mostly
adhoc from the raw fact tables.

Let me just get to the
point. I'm wondering if anyone has had any experience doing a
clickstream data warehouse in postgresql. Only way I can think of
doing this is to keep a user id in the fact table that you would run a
count(distinct(uu_id)) on to get the number of unique users for that
query. so to keep this simple, this is what i have so far, and the
fact table is probably close to 1 billion rows for about 20 days of
data. but of course, this takes forever when you want to dedup by
uu_id by each element. i can include visits in there as well, but most
of the sales/business guys only really want unique users. it's pretty
fast when you query against one time_id or hour/day, but when you have
to scan the whole fact table to get LTV so far, it's crazy.

i've
made a lot of optimizations in postgresql.conf by playing with work_mem
and shared_buffers and such and i think the database is using as much
as it can disk/memory/cpu wise. also vacuuming the tables as much as
possible. just wondering if anyone had any suggestions or could point
out anything i could be doing wrong, or make it better to get at
uniques. doing simple queries by not including uu_id (uniques) is
pretty fast and that's no problem. I've also gotten 3 books from
Kimball about data warehousing including the clickstream one. i've
also tried bizgres version of postgresql using bitmap indexes, but
didn't see a huge difference for what i need, so i'm back to using the
new beta3 of postgresql right now, since i love the new copy command
you can include queries in.

here's a sample query that takes a while to run... just a simple report that shows gender by area of the site.

select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) as clicks, count(distinct(C.uu_id)) as users
from uus as A, areas as B, daily_area_fact as C
where A.uu_id = C.uu_id
and B.area_id = C.area_id
group by gender,area;

so
by just having one day of data, with 3,168,049 rows in the user
dimension table (uus), 17,213,420 in the daily_area_fact table that
joins all the dimension tables, takes about 15 minutes. if i had 30-90
days in this fact table, who knows how long this would take... i know
doing a distinct on uu_id is very expensive, so that's the main problem
here i guess and would want to know if anyone else is doing it this way
or better.

Total query runtime: 878595 ms.
Data retrieval runtime: 1361 ms.
163 rows retrieved.

here's the explain:

QUERY PLAN
------------------------------------------------------------------------------------------------------
GroupAggregate (cost=3846163.34..4104788.14 rows=24200 width=33)
-> Sort (cost=3846163.34..3889196.89 rows=17213420 width=33)
Sort Key: a.gender, b.area
-> Hash Join (cost=132111.19..1306685.84 rows=17213420 width=33)
Hash Cond: (c.uu_id = a.uu_id)
-> Hash Join (cost=86.58..701292.17 rows=17213420 width=25)
Hash Cond: (c.area_id = b.area_id)
-> Seq Scan on daily_area_fact c (cost=0.00..356937.20 rows=17213420 width=16)
-> Hash (cost=78.26..78.26 rows=3326 width=17)
-> Seq Scan on areas b (cost=0.00..78.26 rows=3326 width=17)
-> Hash (cost=124104.49..124104.49 rows=3168049 width=12)
-> Seq Scan on uus a (cost=0.00..124104.49 rows=3168049 width=12)
(12 rows)

So here is a snapshot of my user dimension table for each user (uus table)

dw_big=# \d uus
Table "public.uus"
Column | Type | Modifiers
--------------+------------------------+-----------------------------------------------------
uu_id | integer | not null default nextval('uus_uu_id_seq'::regclass)
uu | character(50) | not null
imps | integer |
clicks | integer |
gca_clicks | integer |
convs | integer |
imp_rev | numeric(10,6) |
click_rev | numeric(10,6) |
total_rev | numeric(10,6) |
geo_id | integer |
reg | integer |
usernum | integer |
gender | character(1) |
age | integer |
age_grp | character(10) |
reg_date | date |
vis | integer |
first_date | date |
first_hour | integer |
last_date | date |
utm_campaign | character varying(100) |
utm_medium | character varying(100) |
utm_source | character varying(100) |
utm_content | character varying(255) |
utm_keyword | character varying(255) |
Indexes:
"uus_pkey" PRIMARY KEY, btree (uu)
"uus_geo_id_index" btree (geo_id)
"uus_usernum_id_index" btree (usernum)
"uus_uu_id_index" btree (uu_id)

here is my "area" of the site dimension table (areas table)

dw_big=# \d areas;
Table "public.areas"
Column | Type | Modifiers
---------+------------------------+---------------------------------------------------------
area_id | integer | not null default nextval('areas_area_id_seq'::regclass)
site | character varying(100) | not null
area | character varying(100) | not null
subarea | character varying(100) | not null
size | character varying(50) | not null
pos | character varying(50) | not null
Indexes:
"areas_pkey" PRIMARY KEY, btree (site, subarea, area, size, pos)
"areas_area_id_index" btree (area_id)

and
here is my huge fact table which joins all the dimension tables
together since they want to be able to query, cut, and slice anything
by anything at any time:

dw_big=# \d daily_area_fact;
Table "public.daily_area_fact"
Column | Type | Modifiers
-----------+---------+-----------
time_id | integer | not null
ccid | integer | not null
area_id | integer | not null
uu_id | integer | not null
geo_id | integer | not null
imps | integer |
clicks | integer |
imp_rev | numeric |
click_rev | numeric |
total_rev | numeric |
Indexes:
"daily_area_fact_pkey" PRIMARY KEY, btree (time_id, ccid, area_id, uu_id, geo_id)


____________________________________
Mark Jensen

Responses

Browse pgsql-general by date

  From Date Subject
Next Message novnov 2006-11-28 18:01:06 Re: Editing contrib modules which are loaded by default?
Previous Message Albe Laurenz 2006-11-28 17:50:34 Re: AIX and getaddrinfo (was Re: Upgrade problem)