Quick Links

Re: [HACKERS] Big IN() clauses etc : feature proposal

From:	PFC <lists(at)peufeu(dot)com>
To:	"Greg Stark" <gsstark(at)mit(dot)edu>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-performance(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Big IN() clauses etc : feature proposal
Date:	2006-05-09 16:29:31
Message-ID:	op.s9alfhjrcigqcu@apollo13
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

> Creating cursors for a simple plan like a single sequential scan is fast
> because it's using the original data from the table.

I used the following query :

SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20

It's a backward index scan + limit... not a seq scan. And it's damn fast :

0.042 ms BEGIN
0.153 ms DECLARE tmp SCROLL CURSOR WITHOUT HOLD FOR SELECT * FROM
bookmarks ORDER BY annonce_id DESC LIMIT 20
0.246 ms FETCH ALL FROM tmp
0.048 ms MOVE FIRST IN tmp
0.246 ms FETCH ALL FROM tmp
0.048 ms CLOSE tmp
0.084 ms COMMIT

> But your example was
> predicated on this part of the job being a complex query. If it's a
> complex
> query involving joins and groupings, etc, then it will have to be
> materialized
> and there's no (good) reason for that to be any faster than a temporary
> table
> which is effectively the same thing.

You mean the cursors'storage is in fact the same internal machinery as a
temporary table ?

In that case, this raises an interesting question : why is the cursor
faster ?

Let's try a real-life example from my website : it is a search query
(quite complex) which is then joined to a lot of tables to resolve FKeys.
To that query I must add add an application-made join using a big IN()
clause extracted from the data.
Timings includes the time to fetch the results into Python.
The "running total" column is the sum of all timings since the BEGIN.

query_time running_total rows query
0.061 ms 0.061 ms -1 BEGIN
23.420 ms 23.481 ms 85 SELECT * FROM (huge query with a
lot of joins)
4.318 ms 27.799 ms 2 SELECT l.*, u.login, u.bg_color
FROM annonces_log l, users u WHERE u.id=l.user_id AND l.annonce_id IN
(list of ids from previous query) ORDER BY annonce_id, added
0.241 ms 28.040 ms -1 COMMIT

(Just in case you want to hurt yourself, here's the EXPLAIN ANALYZE
output : http://peufeu.com/temp/big_explain.txt)
Using a cursor takes about the same time.

Also, doing just the search query takes about 12 ms, the joins take up
the rest.

Now, I'll rewrite my query eliminating the joins and using a temp table.
Storing the whole result in the temp table will be too slow, because
there are too many columns.
Therefore I will only store the primary and foreign key columns, and join
again to the main table to get the full records.

query_time running_total rows query
0.141 ms 0.141 ms -1 BEGIN

Do the search :

8.229 ms 8.370 ms -1 CREATE TEMPORARY TABLE tmp AS
SELECT id, city_id, zipcode, contact_id, contact_group_id, price/terrain
as sort FROM (stripped down search query)
0.918 ms 9.287 ms -1 ANALYZE tmp

Fetch the main data to display :

7.663 ms 16.951 ms 85 SELECT a.* FROM tmp t,
annonces_display a WHERE a.id=t.id ORDER BY t.sort

Fetch log entries associates with each row (one row to many log entries) :

1.021 ms 17.972 ms 2 SELECT l.*, u.login, u.bg_color
FROM annonces_log l, users u, tmp t WHERE u.id=l.user_id AND l.annonce_id
= t.id ORDER BY annonce_id, added
3.468 ms 21.440 ms 216 SELECT annonce_id,
array_accum(list_id) AS list_ids, array_accum(COALESCE(user_id,0)) AS
list_added_by, max(added) AS added_to_list FROM bookmarks GROUP BY
annonce_id

Resolve foreign key relations

1.034 ms 22.474 ms 37 SELECT r.annonce_id FROM
read_annonces r, tmp t WHERE r.annonce_id = t.id
0.592 ms 23.066 ms 9 SELECT * FROM cities_dist_zipcode
WHERE zipcode IN (SELECT zipcode FROM tmp)
0.716 ms 23.782 ms 11 SELECT * FROM cities_dist WHERE id
IN (SELECT city_id FROM tmp)
1.125 ms 24.907 ms 45 SELECT * FROM contacts WHERE id IN
(SELECT contact_id FROM tmp)
0.799 ms 25.705 ms 42 SELECT * FROM contact_groups WHERE
id IN (SELECT contact_group_id FROM tmp)
0.463 ms 26.169 ms -1 DROP TABLE tmp
32.208 ms 58.377 ms -1 COMMIT

From this we see :

Using a temporary table is FASTER than doing the large query with all the
joins. (26 ms versus 28 ms).
It's also nicer and cleaner.
However the COMMIT takes as much time as all the queries together !

Let's run with fsync=off :

query_time running_total rows query
0.109 ms 0.109 ms -1 BEGIN
8.321 ms 8.430 ms -1 CREATE TEMPORARY TABLE tmp AS
SELECT id, city_id, zipcode, contact_id, contact_group_id, price/terrain
as sort FROM (stripped down search query)
0.849 ms 9.280 ms -1 ANALYZE tmp
7.360 ms 16.640 ms 85 SELECT a.* FROM tmp t,
annonces_display a WHERE a.id=t.id ORDER BY t.sort
1.067 ms 17.707 ms 2 SELECT l.*, u.login, u.bg_color
FROM annonces_log l, users u, tmp t WHERE u.id=l.user_id AND l.annonce_id
= t.id ORDER BY annonce_id, added
3.322 ms 21.030 ms 216 SELECT annonce_id,
array_accum(list_id) AS list_ids, array_accum(COALESCE(user_id,0)) AS
list_added_by, max(added) AS added_to_list FROM bookmarks GROUP BY
annonce_id
0.896 ms 21.926 ms 37 SELECT r.annonce_id FROM
read_annonces r, tmp t WHERE r.annonce_id = t.id
0.573 ms 22.499 ms 9 SELECT * FROM cities_dist_zipcode
WHERE zipcode IN (SELECT zipcode FROM tmp)
0.678 ms 23.177 ms 11 SELECT * FROM cities_dist WHERE id
IN (SELECT city_id FROM tmp)
1.064 ms 24.240 ms 45 SELECT * FROM contacts WHERE id IN
(SELECT contact_id FROM tmp)
0.772 ms 25.013 ms 42 SELECT * FROM contact_groups WHERE
id IN (SELECT contact_group_id FROM tmp)
0.473 ms 25.485 ms -1 DROP TABLE tmp
1.777 ms 27.262 ms -1 COMMIT

There, it's good again.

So, when fsync=on, and temporary tables are used, something slow happens
on commit (even if the temp table is ON COMMIT DROP...)
Thoughts ?

In response to

Re: [HACKERS] Big IN() clauses etc : feature proposal at 2006-05-09 15:00:29 from Greg Stark

Responses

Re: [HACKERS] Big IN() clauses etc : feature proposal at 2006-05-10 19:24:01 from Jim C. Nasby

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2006-05-09 16:31:33	Re: Number of dimensions of an array parameter
Previous Message	Mitchell Skinner	2006-05-09 15:58:05	Re: [HACKERS] Big IN() clauses etc : feature proposal

Browse pgsql-performance by date

	From	Date	Subject
Next Message	PFC	2006-05-09 16:38:51	Re: [HACKERS] Big IN() clauses etc : feature proposal
Previous Message	Mitchell Skinner	2006-05-09 15:58:05	Re: [HACKERS] Big IN() clauses etc : feature proposal