Skip site navigation (1) Skip section navigation (2)

efficient data reduction (and deduping)

From: Alessandro Gagliardi <alessandro(at)path(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: efficient data reduction (and deduping)
Date: 2012-03-01 18:27:27
Message-ID: CAAB3BBJQxQ1VVd-mub90a=2H6ezNXx1_bRL28VoxZH5EXWE0Lw@mail.gmail.com (view raw or flat)
Thread:
Lists: pgsql-performance
Hi folks,

I have a system that racks up about 40M log lines per day. I'm able to COPY
the log files into a PostgreSQL table that looks like this:

CREATE TABLE activity_unlogged
(
  user_id character(24) NOT NULL,
  client_ip inet,
  hr_timestamp timestamp without time zone,
  locale character varying,
  log_id character(36),
  method character varying(6),
  server_ip inet,
  uri character varying,
  user_agent character varying
)

Now, I want to reduce that data to get the last activity that was performed
by each user in any given hour. It should fit into a table like this:

CREATE TABLE hourly_activity
(
  activity_hour timestamp without time zone NOT NULL,
  user_id character(24) NOT NULL,
  client_ip inet,
  hr_timestamp timestamp without time zone,
  locale character varying,
  log_id character(36),
  method character varying(6),
  server_ip inet,
  uri character varying,
  user_agent character varying,
  CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
)

where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the primary
key constraint)

I am attempting to do that with the following:

INSERT INTO hourly_activity
    SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
activity_unlogged.user_id,
                    client_ip, hr_timestamp, locale, log_id, method,
server_ip, uri, user_agent
        FROM activity_unlogged,
            (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
                FROM activity_unlogged GROUP BY user_id, date_trunc('hour',
hr_timestamp)) AS last_activity
    WHERE activity_unlogged.user_id = last_activity.user_id AND
activity_unlogged.hr_timestamp = last_activity.last_timestamp;

I have two problems:

   1. It's incredibly slow (like: hours). I assume this is because I am
   scanning through a huge unindexed table twice. I imagine there is a more
   efficient way to do this, but I can't think of what it is. If I were doing
   this in a procedural programming language, it might look something like:
   for row in activity_unlogged:
       if (date_trunc('hour', hr_timestamp), user_id) in
   hourly_activity[(activity_hour, user_id)]:
           if hr_timestamp > hourly_activity[(date_trunc('hour',
   hr_timestamp), user_id)][hr_timestamp]:
               hourly_activity <- row # UPDATE
       else:
           hourly_activity <- row # INSERT
   I suspect some implementation of this (hopefully my pseudocode is at
   least somewhat comprehensible) would be very slow as well, but at least it
   would only go through activity_unlogged once. (Then again, it would have
   to rescan hourly_activity each time, so it really wouldn't be any faster
   at all, would it?) I feel like there must be a more efficient way to do
   this in SQL though I can't put my finger on it.
   2. Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE
   activity_unlogged.user_id = last_activity.user_id AND
   activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to
   multiple records leading to a primary key collision. In such cases, I don't
   really care which of the two rows are picked, I just want to make sure that
   no more than one row is inserted per user per hour. In fact, though I would
   prefer to get the last row for each hour, I could probably get much the
   same effect if I just limited it to one per hour. Though I don't know if
   that really helps at all.

Responses

pgsql-performance by date

Next:From: Claudio FreireDate: 2012-03-01 18:35:26
Subject: Re: efficient data reduction (and deduping)
Previous:From: Craig JamesDate: 2012-03-01 18:19:31
Subject: Re: [planner] Ignore "order by" in subselect if parrent do count(*)

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group