Skip site navigation (1) Skip section navigation (2)

slow self-join query

From: Robert Poor <rdpoor(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: slow self-join query
Date: 2012-03-17 20:56:02
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
Disclaimer: this is a re-post, since I wasn't subscribed the first
time I posted.  Pardon if this is a duplicate.]

The following query is abysmally slow (e.g. 7 hours+).  The goal is to
find "among the users that follow user #1, who do they also follow?"
and to count the latter.

   SELECT L2.followed_id as followed_id, COUNT( AS count
     FROM users AS U1
INNER JOIN links AS L1 ON L1.follower_id =
INNER JOIN links AS L2 ON L2.follower_id =
    WHERE U1.type = 'User::Twitter'
      AND L1.followed_id = 1
 GROUP BY L2.followed_id

Here's the rest of the info.

=== versions

psql (9.1.2, server 8.3.14)

=== schema

 create_table "users", :force => true do |t|
   t.string   "type"

 create_table "links", :force => true do |t|
   t.integer  "followed_id"
   t.integer  "follower_id"

 add_index "links", ["follower_id"], :name => "index_links_on_follower_id"
 add_index "links", ["followed_id", "follower_id"], :name =>
"index_links_on_followed_id_and_follower_id", :unique => true
 add_index "links", ["followed_id"], :name => "index_links_on_followed_id"

=== # of rows

users: 2,525,539
links: 4,559,390

=== explain

 "HashAggregate  (cost=490089.52..490238.78 rows=11941 width=8)"
 "  ->  Hash Join  (cost=392604.44..483190.22 rows=1379860 width=8)"
 "        Hash Cond: (f1.follower_id ="
 "        ->  Bitmap Heap Scan on links f1  (cost=14589.95..55597.70
rows=764540 width=4)"
 "              Recheck Cond: (followed_id = 1)"
 "              ->  Bitmap Index Scan on index_links_on_followed_id
(cost=0.00..14398.82 rows=764540 width=0)"
 "                    Index Cond: (followed_id = 1)"
 "        ->  Hash  (cost=300976.98..300976.98 rows=4559881 width=12)"
 "              ->  Hash Join  (cost=94167.40..300976.98 rows=4559881 width=12)"
 "                    Hash Cond: (f2.follower_id ="
 "                    ->  Seq Scan on links f2  (cost=0.00..77049.81
rows=4559881 width=8)"
 "                    ->  Hash  (cost=53950.20..53950.20 rows=2526496 width=4)"
 "                          ->  Seq Scan on users u1
(cost=0.00..53950.20 rows=2526496 width=4)"
 "                                Filter: ((type)::text =

=== other comments

I'm assuming I'm doing something obviously stupid and that the above
info will be sufficient for anyone skilled in the art to detect the
problem.  However, if needed I will gladly invest the time
to create a subset of the data in order to run EXPLAIN ANALYZE.  (With
the whole dataset, it requires > 7 hours to complete the query.  I
don't want to go down that path again!)

- rdp


pgsql-performance by date

Next:From: Scott MarloweDate: 2012-03-18 00:57:08
Subject: Re: slow self-join query
Previous:From: Kevin GrittnerDate: 2012-03-17 16:16:09
Subject: Re: Shared memory for large PostGIS operations

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group