Skip site navigation (1) Skip section navigation (2)

Re: cost and actual time

From: Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>,pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: cost and actual time
Date: 2003-02-19 09:38:54
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
hello Manfred,

> ... for joining both occurrences:  The "-> Nested Loop" takes two
> tables (the "-> Index Scans") as input and produces one table as
> output which is again used as input for the "-> Hash Join" above it.

as I am testing with the most frequent gene names (= the gene_ids that 
are the most frequent in the occurrences tables) this is a very 
expensive join. whenever I try a less frequent gene_id the runtime is 
shorter (though I haven't tested especially with less frequent gene_ids, 
yet. my focus is on making the searches for the most frequent genes 
faster as these are probably the ones that are searched for a lot.)

> There is no DISTINCT here.  This is equvalent to your first query, iff
> the following unique constraints are true:
> 	(gene_id, sentence_id) in gene_occurrences
> 	(disease_id, sentence_id) in disease_occurrences
> 	(disease_id) in disease
> If they are, you don't need a sub-select (unless I'm missing
> something, please double-check):

yeah, I noticed the difference between the two queries. actually, I am 
afraid of dropping the distinct cause I had results with duplicate rows 
(though I shall recheck when this is really the case).  These are the 
table declarations and constraints:

relate=# \d gene
         Table "public.gene"
    Column    |  Type   | Modifiers
  gene_id     | integer | not null
  gene_name   | text    | not null
  gene_syn_id | integer | not null
Indexes: gene_pkey primary key btree (gene_id),
          gene_name_uni unique btree (gene_name),
          gene_uni unique btree (gene_name, gene_syn_id),
          gene_syn_idx btree (gene_syn_id)

(disease looks the same)

relate_01=# \d gene_occurrences
   Table "public.gene_occurrences"
    Column    |  Type   | Modifiers
  sentence_id | bigint  | not null
  gene_id     | integer | not null
  puid        | integer | not null
Indexes: gene_occ_uni unique btree (sentence_id, gene_id),
          gene_occ_id_i btree (gene_id)

relate_01=# \d disease_occurrences
Table "public.disease_occurrences"
    Column    |  Type   | Modifiers
  sentence_id | bigint  | not null
  disease_id  | integer | not null
  puid        | integer | not null
Indexes: disease_occ_uni unique btree (sentence_id, disease_id),
          disease_occ_id_i btree (disease_id)

sentence_id and gene/disease_id are connected in a n:m relation.
as sentence_id is the primary key of a table with more than 50 million 
rows, we decided not to use a serial as primary key but to use a unique 
combination of two existing values. as this combination is to long for 
an ordinary int, we have to use bigint as type. is the join therefore 
such expensive?

we had a primary key occurrence_id on the occurrences tables but we 
noticed that we don't use it, so we didn't recreate it in the new 
database. is it possible that the postgres could work with it internally?

> Play with enable_xxxx to find out which join method provides the best
> performance for various gene_ids.  Then we can start to fiddle with
> run-time parameters to help the optimizer choose the right plan.

this would be VERY helpful! :-)

I played around and this is the result:

SELECT d.disease_id, d.sentence_id
    FROM gene_occurrences g, disease_occurrences d
   WHERE g.sentence_id = d.sentence_id
     AND g.gene_id = get_gene_id([different very frequent gene names]);

choice of the planner: Nested Loop
  Total runtime: 53508.86 msec

set enable_nextloop to false;
Merge Join:  Total runtime: 113066.81 msec

set enable_mergejoin to false;
Hash Join: Total runtime: 439344.44 msec

disabling the hash join results again in a Nested Loop with very high 
cost but low runtime - I'm not sure if the latter is the consequence of 
caching. I changed the gene name at every run to avoid the caching.

So the Nested Loop is obiously the best way to go?

For comparison: a less frequent gene (occurres 6717 times in 
outputs the following query plan:

Nested Loop  (cost=0.00..41658.69 rows=12119 width=20) (actual 
time=87.01..19076.62 rows=1371 loops=1)
    ->  Index Scan using gene_occ_id_i on gene_occurrences g 
(cost=0.00..10754.08 rows=7514 width=8) (actual time=35.89..10149.14 
rows=6717 loops=1)
          Index Cond: (gene_id = 16338)
    ->  Index Scan using disease_occ_uni on disease_occurrences d 
(cost=0.00..4.09 rows=2 width=12) (actual time=1.32..1.32 rows=0 loops=6717)
          Index Cond: ("outer".sentence_id = d.sentence_id)
  Total runtime: 19078.48 msec

> Usually you set a low random_page_cost value (the default is 4) if you
> want to favour index scans where the optimizer tends to use sequential
> scans.  Was this your intention?

No, not really. I found a posting in the archives where one would 
suggest reducing this parameter, so I tried it. I don't think it had any 
perceptiple effect.

>>cpu_tuple_cost = 0.01           # (same), default 0.01
>>cpu_index_tuple_cost = 0.00001  # (same), default 0.001
>>cpu_operator_cost = 0.005       # (same), default 0.0025
> Just out of curiosity:  Are these settings based on prior experience?

Nope. Same as above. I changed these variables only two days ago for 
what I recall. Untill than I had them at their default.


In response to


pgsql-performance by date

Next:From: Manfred KoizarDate: 2003-02-19 10:34:03
Subject: Re: cost and actual time
Previous:From: Ryan BradetichDate: 2003-02-19 08:10:53
Subject: Re: Questions about indexes?

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group