Skip site navigation (1) Skip section navigation (2)

Re: cost and actual time

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>,pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: cost and actual time
Date: 2003-02-18 17:31:48
Message-ID: (view raw, whole thread or download thread mbox)
Lists: pgsql-performance
On Tue, 18 Feb 2003 11:28:40 +0100, Chantal Ackermann
<chantal(dot)ackermann(at)biomax(dot)de> wrote:
>1. the old query, leaving out the table gene and setting 
>gene_occurrences.gene_id to a certain gene_id, or the function 
>get_gene_id, respectively. (This is the query you suggested, Manfred.)

This was Tom's suggestion.  I might have ended up there in a day or
two :-)

>What takes up most of the runtime the Nested Loop (for the join of 
>disease and disease_occurrences, or rather for joining both occurrences 
>tables? I'm not sure which rows belong together in the explain output).

... for joining both occurrences:  The "-> Nested Loop" takes two
tables (the "-> Index Scans") as input and produces one table as
output which is again used as input for the "-> Hash Join" above it.

>2. The new query, same changes (gene left out, subselect replaced with 
>   SELECT disease.disease_name, count(disease.disease_name) AS cnt
>   FROM
>     ((SELECT gene_occurrences.sentence_id
>         FROM gene_occurrences
>           WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp
>       JOIN disease_occurrences USING (sentence_id)) as tmp2
>     NATURAL JOIN disease
>GROUP BY disease.disease_name

There is no DISTINCT here.  This is equvalent to your first query, iff
the following unique constraints are true:
	(gene_id, sentence_id) in gene_occurrences
	(disease_id, sentence_id) in disease_occurrences
	(disease_id) in disease

If they are, you don't need a sub-select (unless I'm missing
something, please double-check):

   SELECT disease.disease_name, count(*) AS cnt
     FROM disease, gene_occurrences, disease_occurrences
    WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id
      AND gene_occurrences.gene_id=get_gene_id('igm')
      AND disease.disease_id=disease_occurrences.disease_id
    GROUP BY tmp.disease_name
    ORDER BY cnt DESC;

Anyway, your problem boils down to

	SELECT d.disease_id, d.sentence_id
	  FROM gene_occurrences g, disease_occurrences d
	 WHERE g.sentence_id = d.sentence_id
	   AND g.gene_id = 'some constant value';

Play with enable_xxxx to find out which join method provides the best
performance for various gene_ids.  Then we can start to fiddle with
run-time parameters to help the optimizer choose the right plan.

>Most of the runtime is used up by the index scan to join the occurrences 
>tables [...]
>At the moment my settings concering the query planner are:
>effective_cache_size = 80000    # typically 8KB each, default 1000
>random_page_cost = 1.5          # units are one sequential page fetch cost

Usually you set a low random_page_cost value (the default is 4) if you
want to favour index scans where the optimizer tends to use sequential
scans.  Was this your intention?

>cpu_tuple_cost = 0.01           # (same), default 0.01
>cpu_index_tuple_cost = 0.00001  # (same), default 0.001
>cpu_operator_cost = 0.005       # (same), default 0.0025

Just out of curiosity:  Are these settings based on prior experience?


In response to


pgsql-performance by date

Next:From: Josh BerkusDate: 2003-02-18 17:37:20
Subject: Re: Questions about indexes?
Previous:From: Chantal AckermannDate: 2003-02-18 10:28:40
Subject: Re: cost and actual time

Privacy Policy | About PostgreSQL
Copyright © 1996-2017 The PostgreSQL Global Development Group