Re: Endless loop in ExecNestLoop

From: Philipp Reisner <philipp(dot)reisner(at)linbit(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Endless loop in ExecNestLoop
Date: 2006-01-31 09:54:23
Message-ID: 200601311054.23671.philipp.reisner@linbit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Am Montag, 30. Januar 2006 17:42 schrieb Tom Lane:
> Philipp Reisner <philipp(dot)reisner(at)linbit(dot)com> writes:
> > Version: 8.0.2
>
> I don't think so ... neither bitmap scans nor slot_getattr exist in 8.0.
>

Sorry, a typo, 8.1.2 of course.

> > Is this bug-report of any use so far ?
>
> Not a lot. You need to find out what the difference is between the
> cases where the query runs quickly and those where it doesn't. I'm
> betting that the planner is choosing a very bad plan in the slow cases,
> but there's not a lot of evidence here to show what or why.
>
> The explain analyze output shows two levels of hash join underneath
> four levels of nestloop join, whereas the stack trace looks like there
> are five levels of nestloop and only one hash. So this is some evidence
> that a different plan is being used in the slow cases. The stack trace
> doesn't provide nearly enough info about what that plan is, though.
>

What reasons are there for the planer to choose a different plan ?

> Do you have geqo_threshold set to less than its default value? Do you
> sometimes execute the query with different sets of parameters? Either
> of these might lead to changes of plan.

Since we had very bad experience with the generic query optimizer we
completely disabled it. The non geqo plans perform multitudes better
on our data-set and work-load. -- And the time to plan the query is
always negligible compared to the execution time.

We have the endless loop about 10 times a day (but execute about 1.5 million queries a day) :

Program received signal SIGINT, Interrupt.
0x00002aaaaba687b4 in wcsrtombs () from /lib/libc.so.6
(gdb) bt
#0 0x00002aaaaba687b4 in wcsrtombs () from /lib/libc.so.6
#1 0x00002aaaaba163c8 in wcstombs () from /lib/libc.so.6
#2 0x000000000058dcc4 in text_oid ()
#3 0x000000000058deb9 in upper ()
#4 0x00000000004e216e in ExecMakeFunctionResult ()
#5 0x00000000004e210d in ExecMakeFunctionResult ()
#6 0x00000000004e5063 in ExecQual ()
#7 0x00000000004e554e in ExecScan ()
#8 0x00000000004e0cc8 in ExecProcNode ()
#9 0x00000000004ebeb6 in ExecEndHashJoin ()
#10 0x00000000004eb7fa in ExecHashJoin ()
#11 0x00000000004e0d50 in ExecProcNode ()
#12 0x00000000004ee29d in ExecNestLoop ()
#13 0x00000000004e0d36 in ExecProcNode ()
#14 0x00000000004ee261 in ExecNestLoop () <<-----------
#15 0x00000000004e0d36 in ExecProcNode ()
#16 0x00000000004ee261 in ExecNestLoop ()
#17 0x00000000004e0d36 in ExecProcNode ()
#18 0x00000000004ee261 in ExecNestLoop ()
#19 0x00000000004e0d36 in ExecProcNode ()
#20 0x00000000004ee261 in ExecNestLoop ()
#21 0x00000000004e0d36 in ExecProcNode ()
#22 0x00000000004ef57c in ExecSort ()
#23 0x00000000004e0d6a in ExecProcNode ()
#24 0x00000000004df73f in ExecEndPlan ()
#25 0x00000000004dec09 in ExecutorRun ()
#26 0x000000000055f87a in PortalRun ()
#27 0x000000000055f5f9 in PortalRun ()
#28 0x000000000055b561 in pg_plan_queries ()
#29 0x000000000055df8c in PostgresMain ()
#30 0x0000000000539b01 in ClosePostmasterPorts ()
#31 0x00000000005395d0 in ClosePostmasterPorts ()
#32 0x0000000000537b4d in PostmasterMain ()
#33 0x0000000000536e59 in PostmasterMain ()
#34 0x00000000004fda0c in main ()

After detaching with gdb, I killed it. I configured postgres to dump
the statement to the logfile with "log_min_error_statement = error".
So I have the exact query that caused the problem.

This time the plainer says, only explain:

Sort (cost=16257.75..16257.76 rows=1 width=262)
Sort Key: con.shortname
-> Nested Loop (cost=1.18..16257.74 rows=1 width=262)
-> Nested Loop (cost=1.18..16254.57 rows=1 width=130)
-> Nested Loop (cost=1.18..16251.50 rows=1 width=125)
-> Nested Loop (cost=1.18..16248.45 rows=1 width=111)
-> Nested Loop (cost=1.18..16245.31 rows=1 width=86)
Join Filter: ("outer".cel_objid = "inner".objid)
-> Seq Scan on devices dev (cost=0.00..11984.25 rows=1 width=63)
Filter: ((upper((COALESCE(isactiv, 'Y'::character varying))::text) = 'Y'::text) AND (upper((COALESCE(iscommittedsp, 'Y'::character varying))::text) = 'Y'::text) AND (upper((COALESCE(iscommittedcust, 'Y'::character varying))::text) = 'Y'::text))
-> Hash Left Join (cost=1.18..4256.22 rows=387 width=27)
Hash Cond: ("outer".ser_objid = "inner".objid)
-> Seq Scan on contractelements cel (cost=0.00..4253.10 rows=387 width=21)
Filter: (upper((isactiv)::text) = 'Y'::text)
-> Hash (cost=1.14..1.14 rows=14 width=14)
-> Seq Scan on servicetypes ser (cost=0.00..1.14 rows=14 width=14)
-> Index Scan using contracts_pkey on contracts con (cost=0.00..3.13 rows=1 width=33)
Index Cond: ("outer".con_objid = con.objid)
Filter: ((ccu_objid = 3837149) OR (ccu_objid = 3837320) OR (ccu_objid = 3837375) OR (ccu_objid = 3974872) OR (ccu_objid = 5032031) OR (ccu_objid = 7220313) OR (ccu_objid = 15684899) OR (ccu_objid = 16052168) OR (ccu_objid = 19196573) OR (ccu_objid = 20095712) OR (ccu_objid = 22399813) OR (ccu_objid = 42092332) OR (ccu_objid = 42092454) OR (ccu_objid = 42092574) OR (ccu_objid = 42886947) OR (ccu_objid = 43813234))
-> Index Scan using devicetypes_pkey on devicetypes dty (cost=0.00..3.04 rows=1 width=18)
Index Cond: (dty.objid = "outer".dty_objid)
-> Index Scan using bpartners_pkey on bpartners man (cost=0.00..3.06 rows=1 width=13)
Index Cond: (man.objid = "outer".bpa_objid)
-> Index Scan using locations_pkey on locations loc (cost=0.00..3.12 rows=1 width=136)
Index Cond: (loc.objid = "outer".loc_objid)
Filter: (upper((shortname)::text) = '5195'::text)
(26 rows)

Explain and execute does not terminate... even after hours.

This time I _CAN_ _REPRODUCE_ the issue in our staging system !! Would it help
to run the database server with debug symbols ? That would give more information
in the stack traceback ...

Tom, could you give us an advise for what we should look for ? How we should
continue in tracing the bug down ?

In case you care, here is the query as well:

select cel.objid,
cel.shortname, ser.shortname,
con.isInventoryFlag,
dev.objid, dty.objid, loc.objid,
dev.shortname, loc.shortname, loc.name, man.shortName,
dty.type, dev.serNr, dev.invNr, dev.component,
loc.zip, loc.city, con.shortname, con.name,
CASE
WHEN CURRENT_DATE between loc.locationstateFrom and loc.locationstateTo THEN loc.locationstateText
WHEN CURRENT_DATE between loc.locationstate2From and loc.locationstate2To THEN loc.locationstate2Text
ELSE null
END
, CASE
WHEN CURRENT_DATE between loc.locationstateFrom and loc.locationstateTo THEN loc.locationstateColorCode
WHEN CURRENT_DATE between loc.locationstate2From and loc.locationstate2To THEN loc.locationstate2ColorCode
ELSE null
END
,
loc.locationcategory, loc.province
from contracts con, contractelements cel
left outer join servicetypes ser on ser.objid = cel.ser_objid,
devices dev, locations loc, devicetypes dty, bpartners man
where con.ccu_objid in
(3837149,3837320,3837375,3974872,5032031,7220313,15684899,16052168,19196573,20095712,22399813,42092332,42092454,42092574,42886947,43813234) and cel.con_objid = con.objid
and dev.cel_objid = cel.objid
and loc.objid = dev.loc_objid
and dty.objid = dev.dty_objid
and man.objid = dty.bpa_objid
and upper(cel.isActiv) = 'Y'
and upper(coalesce(dev.isActiv,'Y')) = 'Y'
and upper(coalesce(dev.IsCommittedSP,'Y')) = 'Y'
and upper(coalesce(dev.IsCommittedCust,'Y')) = 'Y'
and upper(loc.shortName) = '5195'
order by 18 ASC

-philipp
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Eisentraut 2006-01-31 11:31:24 ECPG connect by URL doesn't work
Previous Message Neil Conway 2006-01-31 02:50:17 Re: Bug#347548: DOMAIN CHECK constraint bypassed