Quick Links

LEFT JOINs not optimized away when not needed

From:	Moshe Jacobson <moshe(at)neadwerx(dot)com>
To:	pgsql-bugs(at)postgresql(dot)org, Robert Mowlavi <robert(at)neadwerx(dot)com>, Chris Autry <chris(at)neadwerx(dot)com>
Subject:	LEFT JOINs not optimized away when not needed
Date:	2014-07-08 14:04:21
Message-ID:	CAJ4CxLkrWctbEhoshWvxnEDi9=PvoDbWzdf5OJqvfv840Hv0_Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

Dear PostgreSQL Gurus,

We currently have php code which, for the purposes of speed optimization,
dynamically constructs queries based on the output columns and filter
conditions requested by the application. This code is very complicated, so
we are trying to simplify things by using views instead. We have in mind to
define the view entirely in terms of left joins, so that when particular
tables in the view are not requested on the output or used in filtering,
Postgres will not join the table, and the query will still be fast. This
would drastically simplify our middleware code.

However, it turns out that Postgres is not optimizing away the left joins
as I would expect. See the following query:

SELECT r.reset
FROM tb_reset r
LEFT JOIN tb_project pj ON pj.project = r.project
LEFT JOIN tb_location l ON l.location = r.location
LEFT JOIN tb_program pg ON pg.program = r.program
LEFT JOIN tb_reset_survey rsv ON rsv.reset = r.reset
LEFT JOIN tb_program_location pl ON pl.program = r.program AND
pl.location = r.location
LEFT JOIN tb_program_type pgt ON pgt.program_type = pg.program_type
LEFT JOIN tb_project_department pd ON pd.project = pj.project
LEFT JOIN tb_department d ON d.department = pd.department
LEFT JOIN tb_fiscal_calendar fc ON fc.day = r.execution_date
LEFT JOIN tb_fiscal_month fm ON fm.fiscal_month = fc.month
LEFT JOIN tb_rollout_schedule rs ON rs.fiscal_week = fc.week AND
rs.fiscal_year = fc.year AND rs.program = r.program
LEFT JOIN tb_reset_team rt ON rt.reset_team = r.reset_team
LEFT JOIN tb_vendor v ON v.vendor = rt.vendor
LEFT JOIN tb_reset_status rst ON rst.reset_status = r.reset_status
LEFT JOIN tb_order_location ol ON ol.location = r.location
LEFT JOIN tb_entity_reset er ON er.reset = r.reset
LEFT JOIN tb_market m ON m.market = l.market
LEFT JOIN tb_district dist ON dist.district = l.district
LEFT JOIN tb_mregion mr ON mr.mregion = l.mregion
LEFT JOIN tb_region rg ON rg.region = l.region
LEFT JOIN tb_ogrp ogrp ON ogrp.ogrp = l.ogrp
LEFT JOIN tb_buying_office byo ON byo.buying_office = l.buying_office
LEFT JOIN tb_project_participant pp ON pp.project = r.project
LEFT JOIN tb_project_status ps ON ps.project_status = pj.project_status
WHERE r.in_scope IS TRUE
GROUP BY r.reset
ORDER BY r.reset
LIMIT 100 OFFSET 0;

Seeing that there is only one output column, and that the results are
grouped by this output column, it seems to me that the optimizer should not
even look at the rest of the tables. However, in reality the following
query plan is produced:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1.43..162.06 rows=100 width=4) (actual time=0.110..3.977
rows=100 loops=1)
-> Group (cost=1.43..3250.84 rows=2023 width=4) (actual
time=0.109..3.941 rows=100 loops=1)
-> Nested Loop Left Join (cost=1.43..3239.44 rows=4560
width=4) (actual time=0.106..3.652 rows=1209 loops=1)
Join Filter: (pp.project = r.project)
Rows Removed by Join Filter: 1626
-> Nested Loop Left Join (cost=1.43..2964.79
rows=4560 width=8) (actual time=0.089..2.172 rows=629 loops=1)
-> Merge Left Join (cost=1.15..2232.83
rows=2023 width=12) (actual time=0.075..1.094 rows=100 loops=1)
Merge Cond: (r.reset = er.reset)
-> Nested Loop Left Join
(cost=1.00..2165.74 rows=2023 width=12) (actual time=0.072..1.051
rows=100 loops=1)
Join Filter: ((rs.fiscal_week =
fc.week) AND (rs.fiscal_year = fc.year))
-> Nested Loop Left Join
(cost=0.85..1318.06 rows=2023 width=24) (actual time=0.068..0.874
rows=100 loops=1)
-> Nested Loop Left Join
(cost=0.56..523.87 rows=2023 width=20) (actual time=0.065..0.744
rows=100 loops=1)
Join Filter: (pd.project
= pj.project)
Rows Removed by Join Filter: 200
-> Merge Left Join
(cost=0.56..431.80 rows=2023 width=24) (actual time=0.043..0.526
rows=100 loops=1)
Merge Cond:
(r.reset = rsv.reset)
-> Nested Loop
Left Join (cost=0.42..409.82 rows=2023 width=24) (actual
time=0.037..0.474 rows=100 loops=1)
-> Index
Scan using tb_reset_pkey on tb_reset r (cost=0.28..72.42 rows=2023
width=28) (actual time=0.023..0.111 rows=100 loops=1)
Filter:
(in_scope IS TRUE)
-> Index
Scan using tb_project_pkey on tb_project pj (cost=0.14..0.16 rows=1
width=8) (actual time=0.002..0.003 rows=1 loops=100)
Index
Cond: (project = r.project)
-> Index Only Scan
using ix_reset_survey on tb_reset_survey rsv (cost=0.14..15.29
rows=130 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Heap Fetches: 0
-> Materialize
(cost=0.00..1.04 rows=3 width=8) (actual time=0.000..0.001 rows=3
loops=100)
-> Seq Scan on
tb_project_department pd (cost=0.00..1.03 rows=3 width=8) (actual
time=0.006..0.007 rows=3 loops=1)
-> Index Scan using
tb_fiscal_calendar_day_key on tb_fiscal_calendar fc (cost=0.29..0.38
rows=1 width=16) (actual time=0.000..0.000 rows=0 loops=100)
Index Cond: (day =
r.execution_date)
-> Index Scan using
tb_rollout_schedule_program_key on tb_rollout_schedule rs
(cost=0.15..0.30 rows=8 width=12) (actual time=0.001..0.001 rows=0
loops=100)
Index Cond: (program = r.program)
-> Index Only Scan using
tb_entity_reset_reset_role_key on tb_entity_reset er
(cost=0.15..39.90 rows=1770 width=4) (actual time=0.001..0.001 rows=0
loops=1)
Heap Fetches: 0
-> Index Only Scan using
tb_order_location_location_key on tb_order_location ol
(cost=0.28..0.34 rows=2 width=4) (actual time=0.004..0.008 rows=6
loops=100)
Index Cond: (location = r.location)
Heap Fetches: 629
-> Materialize (cost=0.00..1.06 rows=4 width=4)
(actual time=0.000..0.001 rows=4 loops=629)
-> Seq Scan on tb_project_participant pp
(cost=0.00..1.04 rows=4 width=4) (actual time=0.002..0.005 rows=4
loops=1)
Total runtime: 4.421 ms
(37 rows)

When I last reported this issue, I was asked to produce a self-contained
example that could be used to reproduce this behavior. This was not easy
because of the presence of sensitive customer data, a complex schema, and
the need to have data in the tables for this problem to manifest. However,
we have created the requested example, and the sql dump of the schema is
attached.

Please let me know why Postgres is behaving this way. We are very eager to
implement this new solution, but we are stalled out right now.

Thank you.

Moshe Jacobson
Manager of Systems Engineering, Nead Werx Inc. <http://www.neadwerx.com>
2323 Cumberland Parkway · Suite 201 · Atlanta, GA 30339

“Quality is not an act, it is a habit.” — Aristotle

Attachment	Content-Type	Size
schema.zip	application/zip	221.9 KB

Responses

Re: LEFT JOINs not optimized away when not needed at 2014-07-08 15:19:31 from Tom Lane

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2014-07-08 15:19:31	Re: LEFT JOINs not optimized away when not needed
Previous Message	Tom Lane	2014-07-07 18:35:11	Re: BUG #10888: application is getting hanged in the poll() function of libpq.so.

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kohei KaiGai	2014-07-08 14:14:29	Re: RLS Design
Previous Message	Michael Paquier	2014-07-08 13:58:50	Re: Doing better at HINTing an appropriate column within errorMissingColumn()