Re: Picking out the most recent row using a time stamp column

From: Dave Johansen <davejohansen(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Picking out the most recent row using a time stamp column
Date: 2013-04-05 16:54:06
Message-ID: CAAcYxUcJpVCVgxo4mdq6LOyCagVA80oXEpYD0pg0dULeDpf0Tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Sat, Feb 26, 2011 at 2:38 PM, Dave Johansen <davejohansen(at)gmail(dot)com>
wrote:
>
> Unfortunately, I'm running 8.3.3 and to my knowledge the windowing stuff
> wasn't added til 8.4.
> Dave
>
> On Feb 26, 2011 2:06 PM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
> > Dave,
> >
> > Why not test the windowing version I posted?

We finally have moved over to 8.4 and so I just wanted to post the
time comparison numbers to show the times on 8.4 as well. This is also
a newer data set with ~700k rows and ~4k distinct id_key values.

1) Dependent subquery
SELECT a.id_key, a.time_stamp, a.value FROM data AS a WHERE
a.time_stamp = (SELECT MAX(time_stamp) FROM data AS b WHERE a.id_key =
b.id_key);
8.3.3: Killed it after a few minutes
8.4.13: Killed it after a few minutes

2) Join against temporary table
SELECT a.id_key, a.time_stamp, a.value FROM data AS a JOIN (SELECT
id_key, MAX(time_stamp) AS max_time_stamp FROM data GROUP BY id_key)
AS b WHERE a.id_key = b.id_key AND a.time_stamp = b.max_time_stamp;
8.3.3: 1.4 s
8.4.13: 0.5 s

3) DISTINCT ON:
SELECT DISTINCT ON (id_key) id_key, time_stamp, value FROM data ORDER
BY id_key, time_stamp DESC;
Without Index:
8.3.3: 34.1 s
8.4.13: 98.7 s
With Index (data(id_key, time_stamp DESC)):
8.3.3: 3.4 s
8.4.13: 1.3 s

4) Auto-populated table
SELECT id_key, time_stamp, value FROM data WHERE rid IN (SELECT rid
FROM latestdata);
8.3.3: 0.2 s
8.4.13: 0.06 s

5) Windowing
SELECT id_key, time_stamp, value FROM (SELECT id_key, time_stamp,
value, row_number() OVER (PARTITION BY id_key ORDER BY time_stamp
DESC) AS ranking FROM data) AS a WHERE ranking=1;
8.3.3: N/A
8.4.13: 1.6 s

So the auto-populated table (#4) is the fastest by an order of
magnitude, but the join against the temporary table (#2) is the next
best option based on speed and doesn't require the extra multi-column
index that DISTINCT ON (#3) does.

On a related note though, is there a way to make the multi-column
index used in the DISTINCT ON more efficient. Based on the results, it
appears that the multi-column index is actually a single index with
the ordering of the tree based on the first value and then the second
value. Is there a way to make it be a "multi-level index"? What I mean
is that the first value is basically a tree/hash that then points to
the second index because if that's possible then that would probably
make the DISTINCT ON (#3) version as fast or faster than the
auto-populated table (#4). Is there a way to create an index like that
in postgres?

Thanks,
Dave

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Merlin Moncure 2013-04-05 18:40:16 Re: Picking out the most recent row using a time stamp column
Previous Message Kevin Grittner 2013-04-05 16:37:26 Re: INDEX Performance Issue