Re: SELECT DISTINCT performance issue

From: K C Lau <kclau60(at)netvigator(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject: Re: SELECT DISTINCT performance issue
Date: 2005-06-06 04:52:31
Message-ID: 6.2.1.2.0.20050606121632.02c346d0@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

At 13:06 05/06/05, Tom Lane wrote:
>K C Lau <kclau60(at)netvigator(dot)com> writes:
> > esdt=> explain analyze select DISTINCT ON (PlayerID) PlayerID,AtDate from
> > Player
> > where PlayerID='22220' order by PlayerID desc, AtDate desc;
> > Unique (cost=0.00..2507.66 rows=1 width=23) (actual time=0.000..187.000
> > rows=1 loops=1)
> > -> Index Scan Backward using pk_player on player (cost=0.00..2505.55
> > rows=8
> > 43 width=23) (actual time=0.000..187.000 rows=1227 loops=1)
> > Index Cond: ((playerid)::text = '22220'::text)
> > Total runtime: 187.000 ms
>
> > It appears that all the 1227 data records for that player were searched,
> > even when doing a backward index scan. I would presume that, after
> locating
> > the index for the highest AtDate, only the first data record needs to be
> > retrieved.
>
>If you'd said LIMIT 1, it indeed would have stopped sooner. Since you
>did not, it had to scan for more outputs.
>
> regards, tom lane

I added LIMIT 1 to the query and it worked fine for a direct query on the
table.

However, our system uses views and joined views extensively and I still got
the performance problem after many tries and finally modifying the view to:

create or replace view VCurPlayer (...) as
select a.... from Player a where a.AtDate =
(select b.AtDate from Player b where b.PlayerID = a.PlayerID and b.AtDate =
(select DISTINCT ON (c.PlayerID) c.AtDate from Player c
where c.PlayerID = b.PlayerID and c.PlayerID = a.PlayerID
order by c.PlayerID desc, c.AtDate desc LIMIT 1)
order by b.PlayerID desc, b.AtDate desc LIMIT 1);

esdt=> explain analyze select PlayerID,AtDate,version from VCurPlayer
where PlayerID='22220';

Index Scan using pk_player on player a (cost=0.00..3969606927.13 rows=59
width
=27) (actual time=328.000..328.000 rows=1 loops=1)
Index Cond: ((playerid)::text = '22220'::text)
Filter: ((atdate)::text = ((subplan))::text)
SubPlan
-> Limit (cost=0.00..337263.45 rows=1 width=23) (actual
time=0.180..0.180
rows=1 loops=1743)
-> Index Scan Backward using pk_player on player
b (cost=0.00..1652
5909.12 rows=49 width=23) (actual time=0.180..0.180 rows=1 loops=1743)
Index Cond: ((playerid)::text = ($1)::text)
Filter: ((atdate)::text = ((subplan))::text)
SubPlan
-> Limit (cost=0.00..1697.53 rows=1 width=23) (actual
time=
0.072..0.072 rows=1 loops=1743)
-> Unique (cost=0.00..1697.53 rows=1 width=23)
(actua
l time=0.063..0.063 rows=1 loops=1743)
-> Index Scan Backward using pk_player on
player
c (cost=0.00..1695.32 rows=885 width=23) (actual time=0.063..0.063
rows=1 loop
s=1743)
Index Cond: (((playerid)::text =
($0)::text
) AND ((playerid)::text = ($1)::text))
Total runtime: 328.000 ms

Two subqueries are needed, whereas a single subquery would return no rows.

Please note that when we first ported our system to PostgreSQL 7.1.1 a few
years ago, we used LIMIT 1 (without DISTINCT) and we did not encounter this
performance problem. Our client is currently using MS Windows (using SELECT
TOP 1 construct) with no such problem either.

Any suggestions?

Regards, KC.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message John Sidney-Woollett 2005-06-06 07:13:10 Download mirrors not found?
Previous Message Michael Fuhr 2005-06-06 04:50:41 Re: survey of the postgresql communiity