Bogus nestloop join estimate, ignores WHERE clause

From: Marti Raudsepp <marti(at)juffo(dot)org>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Bogus nestloop join estimate, ignores WHERE clause
Date: 2011-08-31 12:38:50
Message-ID: CABRT9RAWwn0KQ66K=7cvTuxyTzEHd6P_PxYDkT3gfwg2xqybHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi list!

I'm getting really surprising planner estimates for a query that's
joining another table via a varchar field. All of this was tested on
PostgreSQL 8.4.8, 9.0.4 and 9.1rc1.

The original query is pretty huge, but I managed to shrink it down to this:
SELECT * FROM email_message where email_message.id_code IN (SELECT
id_code FROM client WHERE client_id='83509');

* id_code is an indexed varchar(20) NOT NULL column in both tables
* client_id is the primary key of client.
* There are 149152 rows in email_message and 140975 rows in client
* The most common value in both sides of the join is an empty string.
121970 in email_message and 10753 in client
(Turning the empty values into NULLs helps a little, but still gives
bad estimates)

This is the plan I get:
EXPLAIN SELECT * FROM email_message where email_message.id_code IN
(SELECT id_code FROM client WHERE client_id='83509');
Nested Loop (cost=8.28..36.86 rows=139542 width=101)
-> HashAggregate (cost=8.28..8.29 rows=1 width=11)
-> Index Scan using client_pkey on client (cost=0.00..8.28
rows=1 width=11)
Index Cond: (client_id = 83509)
-> Index Scan using email_message_id_code_idx on email_message
(cost=0.00..28.05 rows=41 width=101)
Index Cond: ((email_message.id_code)::text = (client.id_code)::text)
(6 rows)

This nestloop couldn't possibly generate 139542 rows since the inner
plan is expected to return 1 row and the outer plan 41

After a bit of digging, I figured out that it uses the same estimate
as a semi-join WITHOUT the client_id restriction.
EXPLAIN SELECT * FROM email_message m WHERE EXISTS(SELECT * FROM
client c WHERE m.id_code=c.id_code);
Nested Loop Semi Join (cost=0.00..7725.31 rows=139542 width=101)
-> Seq Scan on email_message m (cost=0.00..3966.52 rows=149152 width=101)
-> Index Scan using client_id_code_idx1 on client c
(cost=0.00..0.39 rows=1 width=11)
Index Cond: ((c.id_code)::text = (m.id_code)::text)

For whatever reason, the 1st query completely ignores the fact that
the client_id clause reduces the result count by a large factor.

So I turned this into a simple JOIN and I'm still seeing bad estimates:

EXPLAIN SELECT * FROM email_message JOIN client USING (id_code) WHERE
client_id='83509';
Nested Loop (cost=0.00..36.85 rows=9396 width=252)
-> Index Scan using client_pkey on client (cost=0.00..8.28 rows=1
width=162)
Index Cond: (client_id = 83509)
-> Index Scan using email_message_id_code_idx on email_message
(cost=0.00..28.05 rows=41 width=101)
Index Cond: ((email_message.id_code)::text = (client.id_code)::text)

This is better, but still overestimates massively.

When I change empty values to NULLs, then this JOIN query starts
estimating correctly. So this one is probably confused because the
empty values would result in a cartesian join.

Are there any reasons why nestloop can't use the known (1 * 41) as its estimate?

Regards,
Marti Raudsepp
voicecom.ee

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-08-31 13:05:31 setlocale() on Windows is broken
Previous Message Bernd Helmle 2011-08-31 12:13:11 Informix FDW - anybody working on this?