Re: cost and actual time

From: Chantal Ackermann <chantal(dot)ackermann(at)biomax(dot)de>
To: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: cost and actual time
Date: 2003-02-17 10:51:56
Message-ID: 3E50BECC.9090107@biomax.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

hello Josh,

thank you for your fast answer. (I had some days off.)

This posting is quite long, I apologize. But I wanted to provide enough
information to outline the problem.

I did some vacuums analyze on all 4 tables concerned (gene, disease,
gene_occurrences, disease_occurrences) to be sure the planner is up to
date - but that did not minimize the difference between the estimation
of resulting rows and the actual result.

I changed the settings for default_statistics_target to 1000 (default
10). The estimation goes up to 102 rows which is little more than
before, and still far away from the actual result.

The effective_cache_size is at 80000.

To be sure I didn't change it to be worse, I checked again with the
default_statistics_target set to 10 and a cache size of 1000 (ran vacuum
afterwards). the estimation is at 92 rows. so there's not a really big
difference.

I wonder, if I can set some geqo or planner settings in the
postgresql.conf file to make the planner estimate better? The database
is exclusively for reading, so it's ok if the time for analyzing the
tables increases.

The query I am testing with is:

EXPLAIN ANALYZE
SELECT tmp.disease_name, count(tmp.disease_name) AS cnt
FROM (SELECT DISTINCT disease.disease_name,
disease_occurrences.sentence_id FROM gene, disease, gene_occurrences,
disease_occurrences
WHERE gene.gene_name='igg'
AND gene_occurrences.sentence_id=disease_occurrences.sentence_id
AND gene.gene_id=gene_occurrences.gene_id
AND disease.disease_id=disease_occurrences.disease_id) AS tmp
GROUP BY tmp.disease_name
ORDER BY cnt DESC;

Row counts are:
'gene': 164657
'disease': 129923
'gene_occurrences': 10484141
'disease_occurrences': 15079045

the gene_id for 'igg' occurres 110637 times in gene_occurrences, it is
the most frequent.

the index on gene_occurences(sentence_id) and
disease_occurrences(sentence_id) is clustered.

I have an alternative query which I am testing to see whether it is
better than the first one:

explain analyze SELECT disease.disease_name, count(disease.disease_name)
AS cnt FROM
((SELECT gene_occurrences.sentence_id FROM gene_occurrences
WHERE gene_occurrences.gene_id=(SELECT gene.gene_id FROM gene
WHERE gene.gene_name='igg')) AS tmp
JOIN disease_occurrences USING (sentence_id)) as tmp2
NATURAL JOIN disease
GROUP BY disease.disease_name
ORDER BY cnt DESC;

the cost estimated by the planner is much higher, thus I thought this
query is worse than the first. However - maybe it's just more accurate?

this is its explain-output (default settings for
default_statistics_target while 'vacuum analyze'):

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=126690.02..126691.47 rows=581 width=57) (actual
time=8066.05..8067.05 rows=3364 loops=1)
Sort Key: count(disease.disease_name)
InitPlan
-> Index Scan using gene_uni on gene (cost=0.00..5.26 rows=1
width=4) (actual time=0.19..0.20 rows=1 loops=1)
Index Cond: (gene_name = 'igg'::text)
-> Aggregate (cost=126619.79..126663.35 rows=581 width=57) (actual
time=7894.33..8055.43 rows=3364 loops=1)
-> Group (cost=126619.79..126648.83 rows=5809 width=57)
(actual time=7894.31..8020.00 rows=30513 loops=1)
-> Sort (cost=126619.79..126634.31 rows=5809 width=57)
(actual time=7894.30..7910.08 rows=30513 loops=1)
Sort Key: disease.disease_name
-> Merge Join (cost=119314.93..126256.64
rows=5809 width=57) (actual time=6723.92..7732.94 rows=30513 loops=1)
Merge Cond: ("outer".disease_id =
"inner".disease_id)
-> Index Scan using disease_pkey on disease
(cost=0.00..6519.14 rows=129923 width=37) (actual time=0.04..742.20
rows=129872 loops=1)
-> Sort (cost=119314.93..119329.45
rows=5809 width=20) (actual time=6723.74..6740.24 rows=30513 loops=1)
Sort Key: disease_occurrences.disease_id
-> Nested Loop (cost=0.00..118951.78
rows=5809 width=20) (actual time=1.19..6558.67 rows=30513 loops=1)
-> Index Scan using
gene_occ_id_i on gene_occurrences (cost=0.00..15700.31 rows=4039
width=8) (actual time=0.36..1404.64 rows=110637 loops=1)
Index Cond: (gene_id = $0)
-> Index Scan using
disease_occ_uni on disease_occurrences (cost=0.00..25.47 rows=8
width=12) (actual time=0.04..0.04 rows=0 loops=110637)
Index Cond:
("outer".sentence_id = disease_occurrences.sentence_id)
Total runtime: 8086.87 msec
(20 rows)

strangely, the estimation is far worse after running vacuum analyze
again with a default_statistics_target of 1000:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=12521.37..12521.61 rows=96 width=57) (actual
time=124967.47..124968.47 rows=3364 loops=1)
Sort Key: count(disease.disease_name)
InitPlan
-> Index Scan using gene_uni on gene (cost=0.00..5.26 rows=1
width=4) (actual time=20.27..20.28 rows=1 loops=1)
Index Cond: (gene_name = 'igg'::text)
-> Aggregate (cost=12510.99..12518.20 rows=96 width=57) (actual
time=124788.71..124956.77 rows=3364 loops=1)
-> Group (cost=12510.99..12515.80 rows=961 width=57) (actual
time=124788.68..124920.10 rows=30513 loops=1)
-> Sort (cost=12510.99..12513.39 rows=961 width=57)
(actual time=124788.66..124804.74 rows=30513 loops=1)
Sort Key: disease.disease_name
-> Nested Loop (cost=0.00..12463.35 rows=961
width=57) (actual time=164.11..124529.76 rows=30513 loops=1)
-> Nested Loop (cost=0.00..6671.06
rows=961 width=20) (actual time=148.34..120295.52 rows=30513 loops=1)
-> Index Scan using gene_occ_id_i on
gene_occurrences (cost=0.00..2407.09 rows=602 width=8) (actual
time=20.63..1613.99 rows=110637 loops=1)
Index Cond: (gene_id = $0)
-> Index Scan using disease_occ_uni
on disease_occurrences (cost=0.00..7.06 rows=2 width=12) (actual
time=1.07..1.07 rows=0 loops=110637)
Index Cond: ("outer".sentence_id
= disease_occurrences.sentence_id)
-> Index Scan using disease_pkey on disease
(cost=0.00..6.01 rows=1 width=37) (actual time=0.13..0.13 rows=1
loops=30513)
Index Cond: ("outer".disease_id =
disease.disease_id)
Total runtime: 124981.15 msec
(18 rows)

There again is the estimation of 961 rows and the decision to choose a
Nested Loop while the actual result includes 30513 rows.

Thank you for taking the time to read my postings!
Chantal

Josh Berkus wrote:
> Chantal,
>
>
>>Sort Key: disease.disease_name, disease_occurrences.sentence_id
>>-> Nested Loop (cost=0.00..6922.38 rows=98 width=64) (actual
>>time=61.49..275047.46 rows=18910 loops=1)
>> -> Nested Loop (cost=0.00..6333.23 rows=98 width=28) (actual
>>time=61.42..274313.87 rows=18910 loops=1)
>> -> Nested Loop (cost=0.00..5894.04 rows=64 width=16) (actual
>>time=32.00..120617.26 rows=46849 loops=1)
>>
>>I tried to tweak the conf settings, but I think I already reached
>>quite a good value concerning shared buffers and sort mem. the
>>database is vacuum full analyzed. indexes seem fine.
>
>
> You *sure* that you've vacuum analyzed recently? The planner above is
> choosing a bad plan because its row estimates are way off ... if the
> subquery was actually returning 98 rows, the plan above would make
> sense ... but with 18,000 rows being returned, a Nested Loop is
> suicidal.
>
> Perhaps you could post the full text of the query? If some of your
> criteria are coming from volatile functions, then that could explain
> why the planner is so far off ...
>
> -Josh Berkus
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomasz Myrta 2003-02-17 10:58:57 Re: [PERFORM] Good performance?
Previous Message Rafal Kedziorski 2003-02-17 09:12:45 Re: Good performance?