TimeOf(Subselects|Joins)FromLargeTables?

From: "Hegedus, Tamas (dot)" <Hegedus(dot)Tamas(at)mayo(dot)edu>
To: "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: TimeOf(Subselects|Joins)FromLargeTables?
Date: 2004-06-05 00:17:50
Message-ID: EDA50E1FFC76D5119A1A0002B30A347406F9D766@excsrv28.mayo.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear All,

(I asked the general-list, I had some advice, but the final conclusion was to ask the developers and tell them a possible mistake of the query planner, too.)

I am a biologist and I do not know what to expect from an RDB (PgSQL).
I have large tables: 1215607 rows in prots, 2184596 rows in kwx (see table details below). I would like to do something like that:

SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx WHERE kw_acc=812);

After executing this (either as a subquery or joins) one of the best/fastest result what I had (SET enable_seqscan=off):
83643.482 ms (see EXPLAIN ANALYZE below).

The two (similar) parts of this query are executed much faster:
SELECT fid FROM kwx WHERE kw_acc=812 -- takes 302ms, n(rows)=78050
SELECT name, seq FROM prots WHERE fid < 80000 -- takes 1969.231 ms

Is this realistic? OK?
If not: how can I increase the speed by fine tuning of the RDB (indexes, run-time parameters) or my SQL query?
(It came now into my mind: if I decrease the number of columns in the prots table (to have only 3 fields (fid, name, seq) instead of 20 columns), than the prots table will have smaller file size on disk, than this table may need less disk page fetches, queries may be faster. Is this true?)

---------------------------------------------
Dann[GENERAL] suggested to use hash-index on prot.fid and kwx.fid.
However, you suggested not to use hash index ("Note: Testing has shown PostgreSQL's hash indexes to perform no better than B-tree indexes, and the index size and build time for hash indexes is much worse. For these reasons, hash index use is presently discouraged."), the query speed increased: 36134.105 ms

According to Dann the planner should have used the hash-indexes instead of b-trees. But I had to remove the b-trees to force the planner to use the hash-indexes. Is this normal, or is there a softer way to force hash-index using of the planner (coexisting hash and b-tree type indexes for the same column).

---------------------------------------------
What should I expect from the following (similar with joins) queries?
How should I optimize the indexes? Which one to use?

SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx WHERE kw_acc=812 AND kw_acc=215);
SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM kwx WHERE kw_acc=812 OR kw_acc=215);

---------------------------------------------
When I executed the same explain analyze select etc., the speed decreased to 5s. This means the statistics of the planner is pretty good. In order to test a real situation (when I do not have statistic), how can I clear this statistics?

Thanks for your help!
Tamas

===============================================
Table "public.prots"
Column | Type | Modifiers
-----------+----------------------+----------
fid | integer | not null
name | character varying(10) | not null
[...other 17 columns...]
seq | text |
Indexes:
"prots_pkey" primary key, btree (fid)
"ix_prots_acc" unique, btree (acc)
"ix_prots_name" unique, btree (name)
"ix_prots_class" btree ("class")
===============================================
Table "public.kwx"
Column | Type | Modifiers
--------+--------+----------
fid | integer |
kw_acc | integer |
Indexes:
"ix_kwx_acc" btree (kw_acc)
"ix_kwx_fid" btree (fid)
Foreign-key constraints:
"fk_kws_acc" FOREIGN KEY (kw_acc) REFERENCES kw_ref(kw_acc)
"fk_kws_fid" FOREIGN KEY (fid) REFERENCES prots(fid)
===============================================

EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on (prots.fid=kwx.fid) where kwx.kw_acc = 812;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=0.00..160429.66 rows=84473 width=349) (actual time=29.039..83505.629 rows=78050 loops=1)
Merge Cond: ("outer".fid = "inner".fid)
-> Index Scan using ix_kwx_fid on kwx (cost=0.00..44987.55 rows=84473 width=4) (actual time=18.893..5730.468 rows=78050 loops=1)
Filter: (kw_acc = 812)
-> Index Scan using prots_pkey on prots (cost=0.00..112005.24 rows=981127 width=353) (actual time=0.083..76059.235 rows=1210377 loops=1)
Total runtime: 83643.482 ms
(6 rows)

===================================================================
--with hash-indexes
===================================================================
explain ANALYZE select p.name, p.seq from prots p, kwx k where p.fid=k.fid AND k.kw_acc=812;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..662253.63 rows=69180 width=344) (actual time=43.337..36076.045 rows=78050 loops=1)
-> Index Scan using ix_kwx_acc on kwx k (cost=0.00..245382.38 rows=69179 width=4) (actual time=0.109..422.159 rows=78050 loops=1)
Index Cond: (kw_acc = 812)
-> Index Scan using prt_fid_ix on prots p (cost=0.00..6.01 rows=1 width=348) (actual time=0.414..0.450 rows=1 loops=78050)
Index Cond: (p.fid = "outer".fid)
Total runtime: 36134.105 ms

====================================================================
EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on (prots.fid=kwx.fid) where kwx.kw_acc = 812;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..560203.74 rows=86657 width=343) (actual time=0.139..13924.698 rows=78050 loops=1)
-> Seq Scan on kwx (cost=0.00..38016.45 rows=86656 width=4) (actual time=0.065..3363.545 rows=78050 loops=1)
Filter: (kw_acc = 812)
-> Index Scan using prt_fid_ix on prots (cost=0.00..6.01 rows=1 width=347) (actual time=0.111..0.129 rows=1 loops=78050)
Index Cond: (prots.fid = "outer".fid)
Total runtime: 13981.767 ms

=====================================================================
EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on (prots.fid=kwx.fid) where kwx.kw_acc = 813;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..288142.81 rows=41509 width=343) (actual time=0.974..8645.423 rows=42388 loops=1)
-> Seq Scan on kwx (cost=0.00..38016.45 rows=41508 width=4) (actual time=0.926..1161.183 rows=42388 loops=1)
Filter: (kw_acc = 813)
-> Index Scan using prt_fid_ix on prots (cost=0.00..6.01 rows=1 width=347) (actual time=0.151..0.170 rows=1 loops=42388)
Index Cond: (prots.fid = "outer".fid)
Total runtime: 8676.116 ms
(6 rows)

Browse pgsql-hackers by date

  From Date Subject
Next Message Dann Corbit 2004-06-05 00:42:31 Re: TimeOf(Subselects|Joins)FromLargeTables?
Previous Message Jan Wieck 2004-06-04 21:25:01 Re: [GENERAL] [HACKERS] Slony-I goes BETA