Re: Picking out the most recent row using a time stamp column

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Dave Johansen <davejohansen(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2013-04-05 18:40:16
Message-ID: CAHyXU0zrGJzr-R1rOqA=7m7k6BOdxauvbsfPYSM9+iaQELaK3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Fri, Apr 5, 2013 at 11:54 AM, Dave Johansen <davejohansen(at)gmail(dot)com> wrote:
> On Sat, Feb 26, 2011 at 2:38 PM, Dave Johansen <davejohansen(at)gmail(dot)com>
> wrote:
>>
>> Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
>> wasn't added til 8.4.
>> Dave
>>
>> On Feb 26, 2011 2:06 PM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
>> > Dave,
>> >
>> > Why not test the windowing version I posted?
>
> We finally have moved over to 8.4 and so I just wanted to post the
> time comparison numbers to show the times on 8.4 as well. This is also
> a newer data set with ~700k rows and ~4k distinct id_key values.
>
> 1) Dependent subquery
> SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE
> a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key =
> b.id_key);
> 8.3.3: Killed it after a few minutes
> 8.4.13: Killed it after a few minutes
>
> 2) Join against temporary table
> SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT
> id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key)
> AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
> 8.3.3: 1.4 s
> 8.4.13: 0.5 s
>
> 3) DISTINCT ON:
> SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER
> BY id_key, time_stamp DESC;
> Without Index:
> 8.3.3: 34.1 s
> 8.4.13: 98.7 s
> With Index (data(id_key, time_stamp DESC)):
> 8.3.3: 3.4 s
> 8.4.13: 1.3 s
>
> 4) Auto-populated table
> SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid
> FROM latestdata);
> 8.3.3: 0.2 s
> 8.4.13: 0.06 s
>
> 5) Windowing
> SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp,
> value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp
> DESC) AS ranking FROM data) AS a WHERE ranking=1;
> 8.3.3: N/A
> 8.4.13: 1.6 s

I would also test:

*) EXISTS()

SELECT a.id_key, a.time_stamp, a.value FROM data
WHERE NOT EXISTS
(
SELECT 1 FROM data b
WHERE
a.id_key = b.id_key
and b.time_stamp > a.time_stamp
);

*) custom aggregate (this will not be the fastest option but is a good
technique to know -- it can be a real life saver when selection
criteria is complex)

CREATE FUNCTION agg_latest_data(data, data) returns data AS
$$
SELECT CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
$$ LANGUAGE SQL IMMUTABLE;

CREATE AGGREGATE latest_data (
SFUNC=agg_latest_data,
STYPE=data
);

SELECT latest_data(d) FROM data d group by d.id_key;

the above returns the composite, not the fields, but that can be worked around.

merlin

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Joe Van Dyk 2013-04-06 01:38:01 slow joins?
Previous Message Dave Johansen 2013-04-05 16:54:06 Re: Picking out the most recent row using a time stamp column