Skip site navigation (1) Skip section navigation (2)

Re: optimized counting of web statistics

From: Matthew Nuzum <mattnuzum(at)gmail(dot)com>
To: Billy extyeightysix <exty86(at)gmail(dot)com>
Cc: Postgresql Performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: optimized counting of web statistics
Date: 2005-06-28 21:55:44
Message-ID: f3c0b408050628145545699c4e@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-performance
On 6/28/05, Billy extyeightysix <exty86(at)gmail(dot)com> wrote:
> Hola folks,
> 
> I have a web statistics Pg database (user agent, urls, referrer, etc)
> that is part of an online web survey system. All of the data derived
> from analyzing web server logs is stored in one large table with each
> record representing an analyzed webserver log entry.
> 
> Currently all reports are generated when the logs are being analyzed
> and before the data ever goes into the large table I mention above.
> Well, the time has come to build an interface that will allow a user
> to make ad-hoc queries against the stats and that is why I am emailing
> the performance list.

Load your data into a big table, then pre-process into additional
tables that have data better organized for running your reports.

For example, you may want a table that shows a sum of all hits for
each site, for each hour of the day. You may want an additional table
that shows the sum of all page views, or maybe sessions for each site
for each hour of the day.

So, if you manage a single site, each day you will add 24 new records
to the sum table.

You may want the following fields:
site (string)
atime (timestamptz)
hour_of_day (int)
day_of_week (int)
total_hits (int8)

A record may look like this:
site | atime | hour_of_day | day_of_week | total_hits
'www.yoursite.com'  '2005-06-28 16:00:00 -0400'  18  2  350

Index all of the fields except total_hits (unless you want a report
that shows all hours where hits were greater than x or less than x).

Doing:
select sum(total_hits) as total_hits from summary_table where atime
between now() and (now() - '7 days'::interval);
should be pretty fast.

You can also normalize your data such as referrers, user agents, etc
and create similar tables to the above.

In case you haven't guessed, I've already done this very thing.

I do my batch processing daily using a python script I've written. I
found that trying to do it with pl/pgsql took more than 24 hours to
process 24 hours worth of logs. I then used C# and in memory hash
tables to drop the time to 2 hours, but I couldn't get mono installed
on some of my older servers. Python proved the fastest and I can
process 24 hours worth of logs in about 15 minutes. Common reports run
in < 1 sec and custom reports run in < 15 seconds (usually).
-- 
Matthew Nuzum
www.bearfruit.org

In response to

Responses

pgsql-performance by date

Next:From: Jean-Max ReymondDate: 2005-06-28 22:10:01
Subject: Re: perl garbage collector
Previous:From: Billy extyeightysixDate: 2005-06-28 20:43:46
Subject: Re: optimized counting of web statistics

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group