Re: cost and actual time

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: cost and actual time
Date: 2003-02-18 17:31:48
Message-ID: ajo45v4o0op0f3qnqh1le2rs79008fqp5u@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, 18 Feb 2003 11:28:40 +0100, Chantal Ackermann
<chantal(dot)ackermann(at)biomax(dot)de> wrote:
>1. the old query, leaving out the table gene and setting
>gene_occurrences.gene_id to a certain gene_id, or the function
>get_gene_id, respectively. (This is the query you suggested, Manfred.)

This was Tom's suggestion. I might have ended up there in a day or
two :-)

>What takes up most of the runtime the Nested Loop (for the join of
>disease and disease_occurrences, or rather for joining both occurrences
>tables? I'm not sure which rows belong together in the explain output).

... for joining both occurrences: The "-> Nested Loop" takes two
tables (the "-> Index Scans") as input and produces one table as
output which is again used as input for the "-> Hash Join" above it.

>2. The new query, same changes (gene left out, subselect replaced with
>get_gene_id):
>
>EXPLAIN ANALYZE
> SELECT disease.disease_name, count(disease.disease_name) AS cnt
> FROM
> ((SELECT gene_occurrences.sentence_id
> FROM gene_occurrences
> WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp
> JOIN disease_occurrences USING (sentence_id)) as tmp2
> NATURAL JOIN disease
>GROUP BY disease.disease_name
>ORDER BY cnt DESC;

There is no DISTINCT here. This is equvalent to your first query, iff
the following unique constraints are true:
(gene_id, sentence_id) in gene_occurrences
(disease_id, sentence_id) in disease_occurrences
(disease_id) in disease

If they are, you don't need a sub-select (unless I'm missing
something, please double-check):

EXPLAIN ANALYZE
SELECT disease.disease_name, count(*) AS cnt
FROM disease, gene_occurrences, disease_occurrences
WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id
AND gene_occurrences.gene_id=get_gene_id('igm')
AND disease.disease_id=disease_occurrences.disease_id
GROUP BY tmp.disease_name
ORDER BY cnt DESC;

Anyway, your problem boils down to

EXPLAIN ANALYZE
SELECT d.disease_id, d.sentence_id
FROM gene_occurrences g, disease_occurrences d
WHERE g.sentence_id = d.sentence_id
AND g.gene_id = 'some constant value';

Play with enable_xxxx to find out which join method provides the best
performance for various gene_ids. Then we can start to fiddle with
run-time parameters to help the optimizer choose the right plan.

>Most of the runtime is used up by the index scan to join the occurrences
>tables [...]
>
>At the moment my settings concering the query planner are:
>
>effective_cache_size = 80000 # typically 8KB each, default 1000
>random_page_cost = 1.5 # units are one sequential page fetch cost

Usually you set a low random_page_cost value (the default is 4) if you
want to favour index scans where the optimizer tends to use sequential
scans. Was this your intention?

>cpu_tuple_cost = 0.01 # (same), default 0.01
>cpu_index_tuple_cost = 0.00001 # (same), default 0.001
>cpu_operator_cost = 0.005 # (same), default 0.0025

Just out of curiosity: Are these settings based on prior experience?

Servus
Manfred

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Berkus 2003-02-18 17:37:20 Re: Questions about indexes?
Previous Message Chantal Ackermann 2003-02-18 10:28:40 Re: cost and actual time