Re: fast DISTINCT or EXIST

From: Tilo Buschmann <mailinglist(dot)postgresql(dot)performance(at)b-n-w(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Arjen van der Meijden <acmmailing(at)tweakers(dot)net>, PostgreSQL Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: fast DISTINCT or EXIST
Date: 2007-04-07 16:24:07
Message-ID: 20070407182407.297e7e35@wonderland
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi everyone,

On Sat, 07 Apr 2007 11:54:08 -0400
Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Arjen van der Meijden <acmmailing(at)tweakers(dot)net> writes:
> > If that is your main culprit, you could also use two limits based on the
> > fact that there will be at most X songs per cd which would match your
> > title (my not very educated guess is 3x). Its a bit ugly... but if that
> > is what it takes to make postgresql not scan your entire index, so be it...
>
> > SELECT ... FROM cd
> > JOIN tracks ...
> > WHERE cd.id IN (SELECT DISTINCT cd_id FROM (SELECT t.cd_id FROM tracks t
> > WHERE t.tstitle @@ plainto_tsquery('simple','education') LIMIT 30)
> > as foo LIMIT 10)
>
> I think that's the only way. There is no plan type in Postgres that
> will generate unique-ified output without scanning the whole input
> first, except for Uniq on pre-sorted input, which we can't use here
> because the tsearch scan isn't going to deliver the rows in cd_id order.

Unfortunately, the query above will definitely not work correctly, if
someone searches for "a" or "the".

The correct query does not perform as well as I hoped.

#v+
cddb=# EXPLAIN ANALYSE SELECT cd.cd_id,cd.artist,cd.title,tracks.title FROM cd JOIN tracks USING (cd_id) WHERE cd_id IN (SELECT DISTINCT tracks.cd_id FROM tracks WHERE tracks.tstitle @@ plainto_tsquery('simple','sympathy') LIMIT 10);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=61031.41..64906.58 rows=139 width=69) (actual time=31236.562..31810.940 rows=166 loops=1)
-> Nested Loop (cost=61031.41..61176.20 rows=10 width=50) (actual time=31208.649..31388.289 rows=10 loops=1)
-> Limit (cost=61031.41..61089.74 rows=10 width=4) (actual time=31185.972..31186.024 rows=10 loops=1)
-> Unique (cost=61031.41..61124.74 rows=16 width=4) (actual time=31185.967..31186.006 rows=10 loops=1)
-> Sort (cost=61031.41..61078.07 rows=18665 width=4) (actual time=31185.961..31185.977 rows=11 loops=1)
Sort Key: public.tracks.cd_id
-> Bitmap Heap Scan on tracks (cost=536.76..59707.31 rows=18665 width=4) (actual time=146.222..30958.057 rows=1677 loops=1)
Recheck Cond: (tstitle @@ '''sympathy'''::tsquery)
-> Bitmap Index Scan on tstitle_tracks_idx (cost=0.00..532.09 rows=18665 width=0) (actual time=126.328..126.328 rows=1677 loops=1)
Index Cond: (tstitle @@ '''sympathy'''::tsquery)
-> Index Scan using cd_id_key on cd (cost=0.00..8.62 rows=1 width=46) (actual time=20.218..20.219 rows=1 loops=10)
Index Cond: (cd.cd_id = "IN_subquery".cd_id)
-> Index Scan using cdid_tracks_idx on tracks (cost=0.00..358.08 rows=1197 width=27) (actual time=39.935..42.247 rows=17 loops=10)
Index Cond: (cd.cd_id = public.tracks.cd_id)
Total runtime: 31811.256 ms
(15 rows)
#v-

It gets better when the rows are in memory (down to 10.452 ms), but
Murphy tells me, that the content that I need will never be in memory.

I think I disregarded this variant at first, because it limits the
possibility to restrict the cd artist and title.

> I can see how to build one: make a variant of HashAggregate that returns
> each input row immediately after hashing it, *if* it isn't a duplicate
> of one already in the hash table. But it'd be a lot of work for what
> seems a rather specialized need.

D'oh.

Actually, I hoped to find an alternative, that does not involve
DISTINCT.

Best Regards,

Tilo

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2007-04-07 16:39:38 Re: fast DISTINCT or EXIST
Previous Message Tom Lane 2007-04-07 15:54:08 Re: fast DISTINCT or EXIST