Sql performace - why soo long ?

From: "Lada 'Ray' Lostak" <ray(at)unreal64(dot)net>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Sql performace - why soo long ?
Date: 2004-12-10 14:27:50
Message-ID: 037801c4dec4$6debc220$0d01a8c0@utopia
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Hi there :)

I don't understand why following '3' statements are SOO different. We
run latest PgSQL, OpenBSD 3.4, AMD XP 2000+ and plenty of RAM... CPU was
fully dedicated to PG. All SQL's do the same and produces same result.

Imagine 2 small tables:

A - ~5000 records (5.000 record is small number :)
B - ~1500 records

B have 'key' to A (let's call it 'key') pointing to 'id' of table A
A includes (also) 'size' column (let's call it 'rsize') - INT type

B.key -> A.id

A.id is normal 'sequence'
B.key have created index (btree)

There are more B tables - B1, B2,... (tenths). So, B means 'some' of these
tables.

I need to get total size of all elemnts belongs to B.

Maybe important - maybe not. Table A includes '2' columns which holds LOOONG
datas (in total hundreds of megabytes). It's why divided from B tables....
And also it is reason, why I don't need to include any other limitation in
following SQL's.

Simplest solution:
-------------------

SELECT sum(rsize) from A where id IN (SELECT key FROM b)

This takes ~10 seconds to execute (!)

Explain:

Aggregate (cost=64235.38..64235.38 rows=1 width=4)
-> Seq Scan on A (cost=0.00..64229.17 rows=2486 width=4)
Filter: (subplan)
SubPlan
-> Seq Scan on B (cost=0.00..25.57 rows=1457 width=0)
(5 rows)

To me 10 seconds sounds really bad - because both tables are small ones
(only 7.5 millions of cross records)... The CPU isn't sooo slow.

'IN' style (generated)
-----------------------
SELECT SUM(rsize) FROM a WHERE id IN (1,2,3,4,5,6,7,8,......) [there is
LOOOONG list]

This takes ~1500 ms to execute - it's ~7x faster than subselect, even there
is over 1500 'ORs'

Explain:

Aggregate (cost=11100.39..11100.39 rows=1 width=4)
Index Scan using B.id,B.id,B.id,B.id,B.id......
Index Cond: ((id = 560) OR (id = 561) OR (id = 562) OR (id = 563) OR (id =
741) OR ......)

The explain is shortcuted at .... place - both includes LONG list (repeating
the same basically).

'JOIN' style
------------
SELECT SUM(rsize) FROM a,b WHERE a.key=b.id

This takes ~46 ms to execute

Explain:

Aggregate (cost=725.67..725.67 rows=1 width=12)
Hash Join (cost=29.22..722.03 rows=1458 width=12)
Hash Cond: ("outer".id = "inner".key)
Seq Scan on A (cost=0.00..649.72 rows=4972 width=8)
Hash (cost=25.58..25.58 rows=1458 width=4)
Seq Scan on B (cost=0.00..25.58 rows=1458 width=4)

----------------------------

All SQL produces same result: 96708362 (which is correct anyway :)

Please, can anyone expalin me (or point good article about it), why there
SOO big difference in execution times ? I would like to more understand how
internally PgSql 'executes' SQL...

Ofcourse, the best solution is JOIN, but isn't point. There is 200x speed
difference...Using generated "ID in (1,2,3...)" with 1500 terms is still 6x
faster rather subselect.... I really can't get it :)

I understand to output (which corespond with real times) - but don't have
clue, why it's soo differnet. I expect results to be 'similar'. Small
tables, 'fast' CPU. Nothing complicated in SQL's.

Thank you,
Best regards,
Lada 'Ray' Lostak
Unreal64 Develop group
http://www.orcave.com
http://www.unreal64.net

--------------------------------------------------------------------------
In the 1960s you needed the power of two C64s to get a rocket
to the moon. Now you need a machine which is a vast number
of times more powerful just to run the most popular GUI.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message MaRCeLO PeReiRA 2004-12-10 15:59:20 test
Previous Message Tomas =?iso-8859-1?q?Sk=E4re?= 2004-12-10 11:40:50 Query is not using index when it should