Re: cost and actual time

From: Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de>
To: Manfred Koizar <mkoi-pg(at)aon(dot)at>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: cost and actual time
Date: 2003-02-18 10:28:40
Message-ID: 3E520AD8.2000304@biomax.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

hello Manfred, hello Josh, hello Tom,

I followed your advices. Using the new query with explicit joins,
combined with the function that retrieves the gene_id, the estimated row
count is now far more realistic. Still, the same query using different
gene names takes sometimes less than a second, sometimes several
minutes, obviously due to (not) caching. In the resulting query plan,
there are still a Seq Scan, a Nested Loop and a Hash Join that take up
most of the cost.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In details:

I have created the following function:
CREATE OR REPLACE FUNCTION get_gene_id(TEXT) RETURNS INT AS
'SELECT gene_id FROM gene WHERE gene_name = $1'
LANGUAGE SQL
IMMUTABLE STRICT;

Then I ran some queries with explain/explain analyze. For example:

1. the old query, leaving out the table gene and setting
gene_occurrences.gene_id to a certain gene_id, or the function
get_gene_id, respectively. (This is the query you suggested, Manfred.)

EXPLAIN ANALYZE
SELECT tmp.disease_name, count(tmp.disease_name) AS cnt
FROM
(SELECT DISTINCT disease.disease_name,
disease_occurrences.sentence_id
FROM disease, gene_occurrences, disease_occurrences
WHERE gene_occurrences.sentence_id=disease_occurrences.sentence_id
AND gene_occurrences.gene_id=get_gene_id('igm')
AND disease.disease_id=disease_occurrences.disease_id) AS tmp
GROUP BY tmp.disease_name ORDER BY cnt DESC;

'igm' occures about 54.000 times in gene_occurrences.

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=308065.26..308069.67 rows=882 width=57) (actual
time=53107.46..53108.13 rows=2326 loops=1)
Sort Key: count(disease_name)
-> Aggregate (cost=307846.61..307978.94 rows=882 width=57) (actual
time=53011.97..53100.58 rows=2326 loops=1)
-> Group (cost=307846.61..307934.83 rows=8822 width=57)
(actual time=53011.94..53079.74 rows=16711 loops=1)
-> Sort (cost=307846.61..307890.72 rows=8822 width=57)
(actual time=53011.93..53020.32 rows=16711 loops=1)
Sort Key: disease_name
-> Subquery Scan tmp (cost=305367.08..306690.35
rows=8822 width=57) (actual time=52877.87..52958.72 rows=16711 loops=1)
-> Unique (cost=305367.08..306690.35
rows=8822 width=57) (actual time=52877.85..52915.20 rows=16711 loops=1)
-> Sort (cost=305367.08..305808.17
rows=88218 width=57) (actual time=52877.85..52886.47 rows=16711 loops=1)
Sort Key: disease.disease_name,
disease_occurrences.sentence_id
-> Hash Join
(cost=4610.17..290873.90 rows=88218 width=57) (actual
time=388.53..52752.92 rows=16711 loops=1)
Hash Cond:
("outer".disease_id = "inner".disease_id)
-> Nested Loop
(cost=0.00..282735.01 rows=88218 width=20) (actual time=0.25..52184.26
rows=16711 loops=1)
-> Index Scan using
gene_occ_id_i on gene_occurrences (cost=0.00..57778.26 rows=54692
width=8) (actual time=0.07..17455.52 rows=54677 loops=1)
Index Cond:
(gene_id = 70764)
-> Index Scan using
disease_occ_uni on disease_occurrences (cost=0.00..4.09 rows=2
width=12) (actual time=0.63..0.63 rows=0 loops=54677)
Index Cond:
("outer".sentence_id = disease_occurrences.sentence_id)
-> Hash
(cost=2500.23..2500.23 rows=129923 width=37) (actual time=387.45..387.45
rows=0 loops=1)
-> Seq Scan on
disease (cost=0.00..2500.23 rows=129923 width=37) (actual
time=0.02..207.71 rows=129923 loops=1)
Total runtime: 53118.81 msec
(20 rows)

What takes up most of the runtime the Nested Loop (for the join of
disease and disease_occurrences, or rather for joining both occurrences
tables? I'm not sure which rows belong together in the explain output).

The cost for 'igg' is higher:
estimation of pages by explain: 584729.76.
actual runtime: 693210.99 msec.
The query plan is the same. The Nested Loop takes up most of the runtime
(-> Nested Loop (cost=0.00..538119.44 rows=176211 width=20) (actual
time=0.28..691474.74 rows=30513 loops=1))

2. The new query, same changes (gene left out, subselect replaced with
get_gene_id):

EXPLAIN ANALYZE
SELECT disease.disease_name, count(disease.disease_name) AS cnt
FROM
((SELECT gene_occurrences.sentence_id
FROM gene_occurrences
WHERE gene_occurrences.gene_id=get_gene_id('csf')) AS tmp
JOIN disease_occurrences USING (sentence_id)) as tmp2
NATURAL JOIN disease
GROUP BY disease.disease_name
ORDER BY cnt DESC;

'csf' occurres about 55.000 times in gene_occurrences.

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=323834.95..323881.54 rows=9318 width=57) (actual
time=146975.89..146976.60 rows=2383 loops=1)
Sort Key: count(disease.disease_name)
-> Aggregate (cost=321208.63..322606.31 rows=9318 width=57)
(actual time=146840.89..146968.58 rows=2383 loops=1)
-> Group (cost=321208.63..322140.42 rows=93179 width=57)
(actual time=146840.87..146941.60 rows=24059 loops=1)
-> Sort (cost=321208.63..321674.53 rows=93179
width=57) (actual time=146840.85..146852.92 rows=24059 loops=1)
Sort Key: disease.disease_name
-> Hash Join (cost=4485.78..305826.96 rows=93179
width=57) (actual time=544.79..146651.05 rows=24059 loops=1)
Hash Cond: ("outer".disease_id =
"inner".disease_id)
-> Nested Loop (cost=0.00..297614.04
rows=93179 width=20) (actual time=105.85..145936.47 rows=24059 loops=1)
-> Index Scan using gene_occ_id_i on
gene_occurrences (cost=0.00..60007.96 rows=57768 width=8) (actual
time=41.86..47116.68 rows=55752 loops=1)
Index Cond: (gene_id = 29877)
-> Index Scan using disease_occ_uni
on disease_occurrences (cost=0.00..4.09 rows=2 width=12) (actual
time=1.76..1.77 rows=0 loops=55752)
Index Cond: ("outer".sentence_id
= disease_occurrences.sentence_id)
-> Hash (cost=2500.23..2500.23 rows=129923
width=37) (actual time=438.16..438.16 rows=0 loops=1)
-> Seq Scan on disease
(cost=0.00..2500.23 rows=129923 width=37) (actual time=0.02..236.78
rows=129923 loops=1)
Total runtime: 146986.12 msec
(16 rows)

This query is obviously not as good as the old one, though I don't
understand where the explicit joins are worse than what the optimizer
choses. There is still the Nested Loop that takes up the biggest part.

When I set enable_nestloop to false, explain outputs this plan:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=2146887.90..2146934.49 rows=9318 width=57)
Sort Key: count(disease.disease_name)
-> Aggregate (cost=2144261.59..2145659.27 rows=9318 width=57)
-> Group (cost=2144261.59..2145193.37 rows=93179 width=57)
-> Sort (cost=2144261.59..2144727.48 rows=93179 width=57)
Sort Key: disease.disease_name
-> Merge Join (cost=2122513.18..2128879.92
rows=93179 width=57)
Merge Cond: ("outer".disease_id =
"inner".disease_id)
-> Index Scan using disease_pkey on disease
(cost=0.00..3388.03 rows=129923 width=37)
-> Sort (cost=2122513.18..2122979.08
rows=93179 width=20)
Sort Key: disease_occurrences.disease_id
-> Merge Join
(cost=69145.63..2107131.52 rows=93179 width=20)
Merge Cond: ("outer".sentence_id
= "inner".sentence_id)
-> Index Scan using
disease_occ_uni on disease_occurrences (cost=0.00..1960817.45
rows=15079045 width=12)
-> Sort
(cost=69145.63..69434.47 rows=57768 width=8)
Sort Key:
gene_occurrences.sentence_id
-> Index Scan using
gene_occ_id_i on gene_occurrences (cost=0.00..60007.96 rows=57768 width=8)
Index Cond: (gene_id
= 29877)
(18 rows)

Most of the runtime is used up by the index scan to join the occurrences
tables, while the index scan for joining diesease and
disease_occurrences is fast.

At the moment my settings concering the query planner are:

effective_cache_size = 80000 # typically 8KB each, default 1000
random_page_cost = 1.5 # units are one sequential page fetch cost
cpu_tuple_cost = 0.01 # (same), default 0.01
cpu_index_tuple_cost = 0.00001 # (same), default 0.001
cpu_operator_cost = 0.005 # (same), default 0.0025

I am still having problems to read the output of explain, especially to
know which rows belong together, and what strategy applies to which
join. Is there some documentation that describes the format of explain,
other than the one in the main manual that comes with the postgres
installation? Just some short explanation or example on how to interpret
indents and arrows.

Thank you for your help!
Chantal

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Manfred Koizar 2003-02-18 17:31:48 Re: cost and actual time
Previous Message Curt Sampson 2003-02-18 05:26:46 Re: WAL replay logic (was Re: [PERFORM] Mount options for