how to interpret/improve bad row estimates

From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: how to interpret/improve bad row estimates
Date: 2006-02-23 16:29:32
Message-ID: 1140712182.2190.297.camel@camel
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

postgresql 8.1, I have two tables, bot hoth vacuumed and analyzed. on
msg307 I have altered the entityid and msgid columns statistics values
to 400.

dev20001=# explain analyze SELECT ewm.entity_id, m.agentname, m.filecreatedate AS versioninfo
FROM msg307 m join entity_watch_map ewm on (ewm.entity_id = m.entityid AND ewm.msgid = m.msgid AND ewm.msg_type = 307);

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=6.62..5227.40 rows=1 width=36) (actual time=0.583..962.346 rows=75322 loops=1)
-> Bitmap Heap Scan on entity_watch_map ewm (cost=6.62..730.47 rows=748 width=8) (actual time=0.552..7.017 rows=1264 loops=1)
Recheck Cond: (msg_type = 307)
-> Bitmap Index Scan on ewm_msg_type (cost=0.00..6.62 rows=748 width=0) (actual time=0.356..0.356 rows=1264 loops=1)
Index Cond: (msg_type = 307)
-> Index Scan using msg307_entityid_msgid_idx on msg307 m (cost=0.00..6.00 rows=1 width=40) (actual time=0.011..0.295 rows=60 loops=1264)
Index Cond: (("outer".entity_id = m.entityid) AND ("outer".msgid = m.msgid))
Total runtime: 1223.469 ms
(8 rows)

I guess that the planner can not tell there is no correlation between
the distinctness of those two columns, and so makes a really bad
estimate on the indexscan, and pushes that estimate up into the nested
loop? (luckily in this case doing an index scan is generally a good
idea, so it works out, but it wouldn't always be a good idea)

some pg_statistics information for those two columns
entityid:
starelid | 25580
staattnum | 1
stanullfrac | 0
stawidth | 4
stadistinct | 1266
stakind1 | 1
stakind2 | 2
stakind3 | 3
stakind4 | 0
staop1 | 96
staop2 | 97
staop3 | 97
staop4 | 0
stanumbers1 | {0.00222976,0.00222976,0.00153048,0.00137216,0.00137216}
stanumbers2 |
stanumbers3 | {0.100312}
stanumbers4 |

msgid:
starelid | 25580
staattnum | 2
stanullfrac | 0
stawidth | 4
stadistinct | 1272
stakind1 | 1
stakind2 | 2
stakind3 | 3
stakind4 | 0
staop1 | 96
staop2 | 97
staop3 | 97
staop4 | 0
stanumbers1 | {0.00164923,0.00163604,0.00163604,0.00163604,0.00137216}
stanumbers2 |
stanumbers3 | {-0.0660856}
stanumbers4 |

is my interpretation of why i am seeing such bad estimates correct? I
don't really think it is, because looking at a similar scenario on a 7.3
machine:

------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=1531.39..5350.90 rows=1 width=48) (actual time=118.44..899.37 rows=58260 loops=1)
Merge Cond: (("outer".entityid = "inner".entity_id) AND ("outer".msgid = "inner".msgid))
-> Index Scan using msg307_entityid_msgid_idx on msg307 m (cost=0.00..3669.42 rows=58619 width=40) (actual time=0.31..390.01 rows=58619 loops=1)
-> Sort (cost=1531.39..1533.16 rows=709 width=8) (actual time=118.09..157.45 rows=58218 loops=1)
Sort Key: ewm.entity_id, ewm.msgid
-> Seq Scan on entity_watch_map ewm (cost=0.00..1497.80 rows=709 width=8) (actual time=0.14..114.74 rows=1157 loops=1)
Filter: (msg_type = 307)
Total runtime: 951.23 msec
(8 rows)

It still has the bad estimate at the nested loop stage, but it does seem
to have a better understanding of the # of rows it will return in the
index scan on msg307. This leads me to wonder if there something I could
do to improve the estimates on the 8.1 machine?

Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

Browse pgsql-performance by date

  From Date Subject
Next Message Ron Peacetree 2006-02-23 16:38:48 Looking for a tool to "*" pg tables as ERDs
Previous Message Vivek Khera 2006-02-23 14:38:25 Re: