Re: people who buy A, also buy C, D, E

From: Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
To: Dan Langille <dan(at)langille(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-04-26 13:27:44
Message-ID: 426E41D0.D4103DEF@rodos.fzk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Dan Langille wrote:
>
> On 26 Apr 2005 at 14:24, Christoph Haller wrote:
>
> > Dan Langille wrote:
> > >
> > > The goal of my query is: given a book, what did other people who
> > > bought this book also buy? I plan the list the 5 most popular such
> > > books. In reality, this isn't about books, but that makes it easier
> > > to understand I think.
> > >
> > > We have a table of customer_id (watch_list_id) and book_id
> > > (element_id).
> > >
> > > freshports.org=# \d watch_list_element
> > > Table "public.watch_list_element"
> > > Column | Type | Modifiers
> > > ---------------+---------+-----------
> > > watch_list_id | integer | not null
> > > element_id | integer | not null
> > > Indexes:
> > > "watch_list_element_pkey" primary key, btree (watch_list_id,
> > > element_id)
> > > "watch_list_element_element_id" btree (element_id)
> > > Foreign-key constraints:
> > > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > > UPDATE CASCADE ON DELETE CASCADE
> > > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > > CASCADE ON DELETE CASCADE
> > >
> > > freshports.org=#
> > >
> > > I have a query which returns the needed results:
> > >
> > > SELECT W.element_id
> > > FROM watch_list_element W
> > > WHERE w.watch_list_id in (select watch_list_id from
> > > watch_list_element where element_id = 54968)
> > > GROUP BY W.element_id
> > > ORDER BY count(W.watch_list_id) DESC
> > > LIMIT 5;
> > >
> > > But performance is an issue here. So I'm planning to calculate all
> > > the possible values and cache them. That is, given each element_id
> > > in a watch_list, what are the top 5 element_id values on all the
> > > lists on which the original element_id appears?
> > >
> > > I'm having trouble constructing the query. I'm not even sure I can
> > > do this in one select, but that would be nice. Examples and clues
> > > are appreciated.
> > >
> > > Any ideas?
> > >
> > > Thank you.
> > > --
> >
> > Just two ideas.
> >
> > 1) Older Postgres versions are notorious for being slow
> > on "IN" clauses.
> > Does this one (untested) perform better:
> >
> > SELECT W.element_id, count(W.watch_list_id)
> > FROM watch_list_element W
> > WHERE EXISTS
> > (SELECT * FROM watch_list_element E
> > WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> > GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5;
>
> I'm on PostgreSQL 7.4.7:
>
> freshports.org=# explain analyse
> freshports.org-# SELECT W.element_id, count(W.watch_list_id)
> freshports.org-# FROM watch_list_element W
> freshports.org-# WHERE EXISTS
> freshports.org-# (SELECT * FROM watch_list_element E
> freshports.org(# WHERE E.element_id = 54968 AND W.watch_list_id =
> E.watch_list_id)
> freshports.org-# GROUP BY W.element_id
> freshports.org-# ORDER BY 2 DESC
> freshports.org-# LIMIT 5;
>
> QUERY PLAN
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------
> Limit (cost=417905.49..417905.51 rows=5 width=8) (actual
> time=3142.480..3142.528 rows=5 loops=1)
> -> Sort (cost=417905.49..417908.08 rows=1033 width=8) (actual
> time=3142.471..3142.486 rows=5 loops=1)
> Sort Key: count(watch_list_id)
> -> HashAggregate (cost=417851.20..417853.78 rows=1033
> width=8) (actual time=3074.170..3112.294 rows=7338 loops=1)
> -> Seq Scan on watch_list_element w
> (cost=0.00..417506.76 rows=68888 width=8) (actual
> time=0.129..2619.989 rows=94018 loops=1)
> Filter: (subplan)
> SubPlan
> -> Index Scan using watch_list_element_pkey
> on watch_list_element e (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.011..0.011 rows=1 loops=137776)
> Index Cond: (($0 = watch_list_id) AND
> (element_id = 54968))
> Total runtime: 3143.304 ms
> (10 rows)
>
> freshports.org=#
>
> Compare that to the original query:
>
> freshports.org=# explain analyse
> freshports.org-# SELECT W.element_id
> freshports.org-# FROM watch_list_element W
> freshports.org-# WHERE w.watch_list_id in (select watch_list_id
> from
> freshports.org(# watch_list_element where element_id = 54968)
> freshports.org-# GROUP BY W.element_id
> freshports.org-# ORDER BY count(W.watch_list_id) DESC
> freshports.org-# LIMIT 5;
>
> QUERY PLAN
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------------------------------------------
> Limit (cost=1174.26..1174.28 rows=5 width=8) (actual
> time=1786.628..1786.675 rows=5 loops=1)
> -> Sort (cost=1174.26..1176.16 rows=758 width=8) (actual
> time=1786.618..1786.634 rows=5 loops=1)
> Sort Key: count(w.watch_list_id)
> -> HashAggregate (cost=1136.11..1138.01 rows=758 width=8)
> (actual time=1718.439..1756.290 rows=7338 loops=1)
> -> Nested Loop (cost=465.98..1132.32 rows=758
> width=8) (actual time=44.372..1292.270 rows=94018 loops=1)
> -> HashAggregate (cost=465.98..465.98 rows=6
> width=4) (actual time=44.332..47.136 rows=599 loops=1)
> -> Index Scan using
> watch_list_element_element_id on watch_list_element
> (cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789
> rows=599 loops=1)
> Index Cond: (element_id = 54968)
> -> Index Scan using watch_list_element_pkey on
> watch_list_element w (cost=0.00..109.48 rows=126 width=8) (actual
> time=0.049..0.863 rows=157 loops=599)
> Index Cond: (w.watch_list_id =
> "outer".watch_list_id)
> Total runtime: 1787.466 ms
> (11 rows)
>
> freshports.org=#
>
> > 2) I suspect calculating all possible values would require time and an
> > enormous cache buffer in size as well as re-calculating pretty often.
> > So my approach would be trying to tune the query before introducing
> > cached results.
>
> Yes, it's pretty extensive: for each element (there are 9000), get me
> all the watch lists it appears upon, then get me the top 5 elements
> on *those* watch lists.
> --

So your original query outruns my "EXISTS" version
by a factor of 2. I did not realize the "IN" has
become so much faster.
I am running out of clues.
May be the [PERFORMANCE] list has more
qualified approaches.
Regards, Christoph

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Rodrigo Carvalhaes 2005-04-26 16:07:18 Re: UPDATE WITH ORDER BY
Previous Message Dan Langille 2005-04-26 12:35:20 Re: people who buy A, also buy C, D, E