Re: in() VS exists()

From: Julien <jcigar(at)ulb(dot)ac(dot)be>
To: Bart Degryse <Bart(dot)Degryse(at)indicator(dot)be>
Cc: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: in() VS exists()
Date: 2008-03-13 16:50:41
Message-ID: 1205427041.3240.44.camel@frodon.be-bif.ulb.ac.be
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

>From my experience I tend to avoid SELECT DISTINCT queries because it's
usually very slow with many rows ...

For my specific case the result is the same:

muridae=> select count(distinct s.specimen_id) from specimens sp INNER
JOIN sequences s on s.specimen_id = sp.id;
count
-------
1431
(1 row)

Time: 65.351 ms
muridae=> select count(sp.id) from specimens sp where sp.id in (select
specimen_id from sequences group by specimen_id);
count
-------
1431
(1 row)

Time: 66.371 ms

But to give an example, I have a table with ~1 000 000 rows where the
DISTINCT solution is more than 10 times slower :

muridae=> select count(distinct sp.id) from specimens sp INNER JOIN
specimen_measurements m ON m.specimen_id = sp.id;
count
-------
75241
(1 row)

Time: 15970.668 ms

muridae=> select count(sp.id) from specimens sp INNER JOIN (select
specimen_id from specimen_measurements GROUP BY specimen_id) as foo on
foo.specimen_id = sp.id;
count
-------
75241
(1 row)

Time: 1165.487 ms

Regards,
Julien

On Thu, 2008-03-13 at 15:28 +0100, Bart Degryse wrote:
> how about
> select count(distinct s.specimen_id) from specimens sp INNER JOIN
> sequences s
> on s.specimen_id = sp.id;
>
>
> >>> Julien <jcigar(at)ulb(dot)ac(dot)be> 2008-03-13 17:27 >>>
> mmh no because it's a one to many relation (a specimen can have more
> than one sequence) :
>
> muridae=> select count(sp.id) from specimens sp INNER JOIN sequences s
> on s.specimen_id = sp.id;
> count
> -------
> 1536
> (1 row)
>
> Time: 81.242 ms
> muridae=> select count(sp.id) from specimens sp where sp.id in (select
> specimen_id from sequences group by specimen_id);
> count
> -------
> 1431
> (1 row)
>
> Time: 81.736 ms
> muridae=>
>
> (of course this is a bad example, because I could just do: select
> count(specimen_id) from sequences group by specimen_id;, but in my
> application I have more fields coming from specimens of course)
>
> Julien
>
> On Thu, 2008-03-13 at 15:12 +0100, Bart Degryse wrote:
> > I think that just
> > select count(sp.id) from specimens sp INNER JOIN sequences s on
> > s.specimen_id = sp.id;
> > should be enough
> >
> > >>> Julien <jcigar(at)ulb(dot)ac(dot)be> 2008-03-13 17:10 >>>
> > If I understood well the query plan, the planner optimize the
> > IN(SELECT ...) version with a JOIN (line 19-20 of the first paste) :
> >
> > -> Hash IN Join (cost=240.95..4011.20 rows=1436 width=4) (actual
> > time=93.971..201.908 rows=1431 loops=1)
> > Hash Cond: ("outer".id = "inner".specimen_id)
> >
> > so I guess that :
> >
> > select count(sp.id) from specimens sp where sp.id in (select
> > specimen_id
> > from sequences);
> >
> > is almost the same as :
> >
> > select count(sp.id) from specimens sp INNER JOIN (select specimen_id
> > from sequences GROUP BY specimen_id) as foo on foo.specimen_id =
> > sp.id;
> >
> > ?
> >
> > Thanks,
> > Julien
> >
> > On Thu, 2008-03-13 at 14:46 +0100, Bart Degryse wrote:
> > > The chapter on indexes in the manual should give you a pretty good
> > > idea on the why.
> > > IN and EXISTS are not the only possibilities, you can also use
> inner
> > > or outer joins.
> > > Which solution performs best depends on the data, the database
> > > version, the available indexes, ...
> > >
> > > >>> Julien <jcigar(at)ulb(dot)ac(dot)be> 2008-03-13 15:47 >>>
> > > Hello,
> > >
> > > Does anyone has an idea why sometimes:
> > > - select ... where ... in (select ...)
> > > is faster than :
> > > - select ... where ... exists(select ...)
> > > and sometimes it's the opposite ?
> > >
> > > I had such a situation, I've pasted the queries on:
> > > http://rafb.net/p/KXNZ6892.html and
> http://rafb.net/p/jvo5DO38.html
> > >
> > > It's running PostgreSQL 8.1 with an effective_cache_size of 30000.
> > >
> > > specimens.id is the primary key and there are indexes on
> > > sequences(specimen_id) and specimen_measurements(specimen_id)
> > >
> > > Is there a general "rule" to know when to use the in() version and
> > > when
> > > to use the exists() version ? Is it true to say that the exists()
> > > version is more scalable (with many rows) than the in() version
> > (from
> > > the little tests I made it seems the case) ?
> > >
> > > Thanks,
> > > Julien
> > >
> > > --
> > > Julien Cigar
> > > Belgian Biodiversity Platform
> > > http://www.biodiversity.be
> > > Université Libre de Bruxelles (ULB)
> > > Campus de la Plaine CP 257
> > > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > > Boulevard du Triomphe, entrée ULB 2
> > > B-1050 Bruxelles
> > > Mail: jcigar(at)ulb(dot)ac(dot)be
> > > @biobel: http://biobel.biodiversity.be/person/show/471
> > > Tel : 02 650 57 52
> > >
> > >
> > > --
> > > Sent via pgsql-sql mailing list (pgsql-sql(at)postgresql(dot)org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-sql
> > >
> > --
> > Julien Cigar
> > Belgian Biodiversity Platform
> > http://www.biodiversity.be
> > Université Libre de Bruxelles (ULB)
> > Campus de la Plaine CP 257
> > Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> > Boulevard du Triomphe, entrée ULB 2
> > B-1050 Bruxelles
> > Mail: jcigar(at)ulb(dot)ac(dot)be
> > @biobel: http://biobel.biodiversity.be/person/show/471
> > Tel : 02 650 57 52
> >
> >
> > --
> > Sent via pgsql-sql mailing list (pgsql-sql(at)postgresql(dot)org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-sql
> >
> --
> Julien Cigar
> Belgian Biodiversity Platform
> http://www.biodiversity.be
> Université Libre de Bruxelles (ULB)
> Campus de la Plaine CP 257
> Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
> Boulevard du Triomphe, entrée ULB 2
> B-1050 Bruxelles
> Mail: jcigar(at)ulb(dot)ac(dot)be
> @biobel: http://biobel.biodiversity.be/person/show/471
> Tel : 02 650 57 52
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
--
Julien Cigar
Belgian Biodiversity Platform
http://www.biodiversity.be
Université Libre de Bruxelles (ULB)
Campus de la Plaine CP 257
Bâtiment NO, Bureau 4 N4 115C (Niveau 4)
Boulevard du Triomphe, entrée ULB 2
B-1050 Bruxelles
Mail: jcigar(at)ulb(dot)ac(dot)be
@biobel: http://biobel.biodiversity.be/person/show/471
Tel : 02 650 57 52

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Aarni Ruuhimäki 2008-03-13 18:25:27 Counting days ...
Previous Message Julien 2008-03-13 16:27:43 Re: in() VS exists()