Re: efficient data reduction (and deduping)

From: Alessandro Gagliardi <alessandro(at)path(dot)com>
To: Craig James <cjames(at)emolecules(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: efficient data reduction (and deduping)
Date: 2012-03-01 19:30:06
Message-ID: CAAB3BBJNq_utncTKD1NGVfNWTkYmaXgmG269A1+M0QTobca9fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hah! Yeah, that might would work. Except that I suck at grep. :(
Perhaps that's a weakness I should remedy.

On Thu, Mar 1, 2012 at 10:35 AM, Craig James <cjames(at)emolecules(dot)com> wrote:

> On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
> <alessandro(at)path(dot)com> wrote:
> > Hi folks,
> >
> > I have a system that racks up about 40M log lines per day. I'm able to
> COPY
> > the log files into a PostgreSQL table that looks like this:
>
> Since you're using a COPY command and the table has a simple column
> with exactly the value you want, why not filter it using grep(1) or
> something similar and load the filtered result directly into the
> hourly table?
>
> Craig
>
> >
> > CREATE TABLE activity_unlogged
> > (
> > user_id character(24) NOT NULL,
> > client_ip inet,
> > hr_timestamp timestamp without time zone,
> > locale character varying,
> > log_id character(36),
> > method character varying(6),
> > server_ip inet,
> > uri character varying,
> > user_agent character varying
> > )
> >
> > Now, I want to reduce that data to get the last activity that was
> performed
> > by each user in any given hour. It should fit into a table like this:
> >
> > CREATE TABLE hourly_activity
> > (
> > activity_hour timestamp without time zone NOT NULL,
> > user_id character(24) NOT NULL,
> > client_ip inet,
> > hr_timestamp timestamp without time zone,
> > locale character varying,
> > log_id character(36),
> > method character varying(6),
> > server_ip inet,
> > uri character varying,
> > user_agent character varying,
> > CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
> > )
> >
> > where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the
> primary
> > key constraint)
> >
> > I am attempting to do that with the following:
> >
> > INSERT INTO hourly_activity
> > SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
> > activity_unlogged.user_id,
> > client_ip, hr_timestamp, locale, log_id, method,
> > server_ip, uri, user_agent
> > FROM activity_unlogged,
> > (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
> > FROM activity_unlogged GROUP BY user_id,
> date_trunc('hour',
> > hr_timestamp)) AS last_activity
> > WHERE activity_unlogged.user_id = last_activity.user_id AND
> > activity_unlogged.hr_timestamp = last_activity.last_timestamp;
> >
> > I have two problems:
> >
> > It's incredibly slow (like: hours). I assume this is because I am
> scanning
> > through a huge unindexed table twice. I imagine there is a more efficient
> > way to do this, but I can't think of what it is. If I were doing this in
> a
> > procedural programming language, it might look something like:
> > for row in activity_unlogged:
> > if (date_trunc('hour', hr_timestamp), user_id) in
> > hourly_activity[(activity_hour, user_id)]:
> > if hr_timestamp > hourly_activity[(date_trunc('hour',
> > hr_timestamp), user_id)][hr_timestamp]:
> > hourly_activity <- row # UPDATE
> > else:
> > hourly_activity <- row # INSERT
> > I suspect some implementation of this (hopefully my pseudocode is at
> least
> > somewhat comprehensible) would be very slow as well, but at least it
> would
> > only go through activity_unlogged once. (Then again, it would have to
> > rescan hourly_activity each time, so it really wouldn't be any faster at
> > all, would it?) I feel like there must be a more efficient way to do
> this in
> > SQL though I can't put my finger on it.
> > Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE
> > activity_unlogged.user_id = last_activity.user_id AND
> > activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to
> > multiple records leading to a primary key collision. In such cases, I
> don't
> > really care which of the two rows are picked, I just want to make sure
> that
> > no more than one row is inserted per user per hour. In fact, though I
> would
> > prefer to get the last row for each hour, I could probably get much the
> same
> > effect if I just limited it to one per hour. Though I don't know if that
> > really helps at all.
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Alessandro Gagliardi 2012-03-01 19:35:48 Re: efficient data reduction (and deduping)
Previous Message Alessandro Gagliardi 2012-03-01 19:29:00 Re: efficient data reduction (and deduping)