Skip site navigation (1) Skip section navigation (2)

Re: cost and actual time

From: Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>,pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: cost and actual time
Date: 2003-02-18 10:28:40
Message-ID: 3E520AD8.2000304@biomax.de (view raw or flat)
Thread:
Lists: pgsql-performance
hello Manfred, hello Josh, hello Tom,

I followed your advices. Using the new query with explicit joins, 
combined with the function that retrieves the gene_id, the estimated row 
count is now far more realistic. Still, the same query using different 
gene names takes sometimes less than a second, sometimes several 
minutes, obviously due to (not) caching. In the resulting query plan, 
there are still a Seq Scan, a Nested Loop and a Hash Join that take up 
most of the cost.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In details:

I have created the following function:
CREATE OR REPLACE FUNCTION get_gene_id(TEXT) RETURNS INT AS
         'SELECT gene_id FROM gene WHERE gene_name = $1'
LANGUAGE SQL
IMMUTABLE STRICT;

Then I ran some queries with explain/explain analyze. For example:

1. the old query, leaving out the table gene and setting 
gene_occurrences.gene_id to a certain gene_id, or the function 
get_gene_id, respectively. (This is the query you suggested, Manfred.)

EXPLAIN ANALYZE
   SELECT tmp.disease_name, count(tmp.disease_name) AS cnt
     FROM
       (SELECT DISTINCT disease.disease_name,
                        disease_occurrences.sentence_id
        FROM disease, gene_occurrences, disease_occurrences
        WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id
          AND gene_occurrences.gene_id=get_gene_id('igm')
          AND disease.disease_id=disease_occurrences.disease_id) AS tmp
GROUP BY tmp.disease_name ORDER BY cnt DESC;

'igm' occures about 54.000 times in gene_occurrences.
 
                       QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=308065.26..308069.67 rows=882 width=57) (actual 
time=53107.46..53108.13 rows=2326 loops=1)
    Sort Key: count(disease_name)
    ->  Aggregate  (cost=307846.61..307978.94 rows=882 width=57) (actual 
time=53011.97..53100.58 rows=2326 loops=1)
          ->  Group  (cost=307846.61..307934.83 rows=8822 width=57) 
(actual time=53011.94..53079.74 rows=16711 loops=1)
                ->  Sort  (cost=307846.61..307890.72 rows=8822 width=57) 
(actual time=53011.93..53020.32 rows=16711 loops=1)
                      Sort Key: disease_name
                      ->  Subquery Scan tmp  (cost=305367.08..306690.35 
rows=8822 width=57) (actual time=52877.87..52958.72 rows=16711 loops=1)
                            ->  Unique  (cost=305367.08..306690.35 
rows=8822 width=57) (actual time=52877.85..52915.20 rows=16711 loops=1)
                                  ->  Sort  (cost=305367.08..305808.17 
rows=88218 width=57) (actual time=52877.85..52886.47 rows=16711 loops=1)
                                        Sort Key: disease.disease_name, 
disease_occurrences.sentence_id
                                        ->  Hash Join 
(cost=4610.17..290873.90 rows=88218 width=57) (actual 
time=388.53..52752.92 rows=16711 loops=1)
                                              Hash Cond: 
("outer".disease_id = "inner".disease_id)
                                              ->  Nested Loop 
(cost=0.00..282735.01 rows=88218 width=20) (actual time=0.25..52184.26 
rows=16711 loops=1)
                                                    ->  Index Scan using 
gene_occ_id_i on gene_occurrences  (cost=0.00..57778.26 rows=54692 
width=8) (actual time=0.07..17455.52 rows=54677 loops=1)
                                                          Index Cond: 
(gene_id = 70764)
                                                    ->  Index Scan using 
disease_occ_uni on disease_occurrences  (cost=0.00..4.09 rows=2 
width=12) (actual time=0.63..0.63 rows=0 loops=54677)
                                                          Index Cond: 
("outer".sentence_id = disease_occurrences.sentence_id)
                                              ->  Hash 
(cost=2500.23..2500.23 rows=129923 width=37) (actual time=387.45..387.45 
rows=0 loops=1)
                                                    ->  Seq Scan on 
disease  (cost=0.00..2500.23 rows=129923 width=37) (actual 
time=0.02..207.71 rows=129923 loops=1)
  Total runtime: 53118.81 msec
(20 rows)

What takes up most of the runtime the Nested Loop (for the join of 
disease and disease_occurrences, or rather for joining both occurrences 
tables? I'm not sure which rows belong together in the explain output).

The cost for 'igg' is higher:
estimation of pages by explain: 584729.76.
actual runtime: 693210.99 msec.
The query plan is the same. The Nested Loop takes up most of the runtime 
(->  Nested Loop  (cost=0.00..538119.44 rows=176211 width=20) (actual 
time=0.28..691474.74 rows=30513 loops=1))



2. The new query, same changes (gene left out, subselect replaced with 
get_gene_id):

EXPLAIN ANALYZE
   SELECT disease.disease_name, count(disease.disease_name) AS cnt
   FROM
     ((SELECT gene_occurrences.sentence_id
         FROM gene_occurrences
           WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp
       JOIN disease_occurrences USING (sentence_id)) as tmp2
     NATURAL JOIN disease
GROUP BY disease.disease_name
ORDER BY cnt DESC;

'csf' occurres about 55.000 times in gene_occurrences.
 
              QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=323834.95..323881.54 rows=9318 width=57) (actual 
time=146975.89..146976.60 rows=2383 loops=1)
    Sort Key: count(disease.disease_name)
    ->  Aggregate  (cost=321208.63..322606.31 rows=9318 width=57) 
(actual time=146840.89..146968.58 rows=2383 loops=1)
          ->  Group  (cost=321208.63..322140.42 rows=93179 width=57) 
(actual time=146840.87..146941.60 rows=24059 loops=1)
                ->  Sort  (cost=321208.63..321674.53 rows=93179 
width=57) (actual time=146840.85..146852.92 rows=24059 loops=1)
                      Sort Key: disease.disease_name
                      ->  Hash Join  (cost=4485.78..305826.96 rows=93179 
width=57) (actual time=544.79..146651.05 rows=24059 loops=1)
                            Hash Cond: ("outer".disease_id = 
"inner".disease_id)
                            ->  Nested Loop  (cost=0.00..297614.04 
rows=93179 width=20) (actual time=105.85..145936.47 rows=24059 loops=1)
                                  ->  Index Scan using gene_occ_id_i on 
gene_occurrences  (cost=0.00..60007.96 rows=57768 width=8) (actual 
time=41.86..47116.68 rows=55752 loops=1)
                                        Index Cond: (gene_id = 29877)
                                  ->  Index Scan using disease_occ_uni 
on disease_occurrences  (cost=0.00..4.09 rows=2 width=12) (actual 
time=1.76..1.77 rows=0 loops=55752)
                                        Index Cond: ("outer".sentence_id 
= disease_occurrences.sentence_id)
                            ->  Hash  (cost=2500.23..2500.23 rows=129923 
width=37) (actual time=438.16..438.16 rows=0 loops=1)
                                  ->  Seq Scan on disease 
(cost=0.00..2500.23 rows=129923 width=37) (actual time=0.02..236.78 
rows=129923 loops=1)
  Total runtime: 146986.12 msec
(16 rows)

This query is obviously not as good as the old one, though I don't 
understand where the explicit joins are worse than what the optimizer 
choses. There is still the Nested Loop that takes up the biggest part.

When I set enable_nestloop to false, explain outputs this plan:
 
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=2146887.90..2146934.49 rows=9318 width=57)
    Sort Key: count(disease.disease_name)
    ->  Aggregate  (cost=2144261.59..2145659.27 rows=9318 width=57)
          ->  Group  (cost=2144261.59..2145193.37 rows=93179 width=57)
                ->  Sort  (cost=2144261.59..2144727.48 rows=93179 width=57)
                      Sort Key: disease.disease_name
                      ->  Merge Join  (cost=2122513.18..2128879.92 
rows=93179 width=57)
                            Merge Cond: ("outer".disease_id = 
"inner".disease_id)
                            ->  Index Scan using disease_pkey on disease 
  (cost=0.00..3388.03 rows=129923 width=37)
                            ->  Sort  (cost=2122513.18..2122979.08 
rows=93179 width=20)
                                  Sort Key: disease_occurrences.disease_id
                                  ->  Merge Join 
(cost=69145.63..2107131.52 rows=93179 width=20)
                                        Merge Cond: ("outer".sentence_id 
= "inner".sentence_id)
                                        ->  Index Scan using 
disease_occ_uni on disease_occurrences  (cost=0.00..1960817.45 
rows=15079045 width=12)
                                        ->  Sort 
(cost=69145.63..69434.47 rows=57768 width=8)
                                              Sort Key: 
gene_occurrences.sentence_id
                                              ->  Index Scan using 
gene_occ_id_i on gene_occurrences  (cost=0.00..60007.96 rows=57768 width=8)
                                                    Index Cond: (gene_id 
= 29877)
(18 rows)

Most of the runtime is used up by the index scan to join the occurrences 
tables, while the index scan for joining diesease and 
disease_occurrences is fast.




At the moment my settings concering the query planner are:

effective_cache_size = 80000    # typically 8KB each, default 1000
random_page_cost = 1.5          # units are one sequential page fetch cost
cpu_tuple_cost = 0.01           # (same), default 0.01
cpu_index_tuple_cost = 0.00001  # (same), default 0.001
cpu_operator_cost = 0.005       # (same), default 0.0025



I am still having problems to read the output of explain, especially to 
know which rows belong together, and what strategy applies to which 
join. Is there some documentation that describes the format of explain, 
other than the one in the main manual that comes with the postgres 
installation? Just some short explanation or example on how to interpret 
indents and arrows.

Thank you for your help!
Chantal


In response to

Responses

pgsql-performance by date

Next:From: Manfred KoizarDate: 2003-02-18 17:31:48
Subject: Re: cost and actual time
Previous:From: Curt SampsonDate: 2003-02-18 05:26:46
Subject: Re: WAL replay logic (was Re: [PERFORM] Mount options for

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group