A very quick observation of dangling pointers in Postgres pathlists

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Subject: A very quick observation of dangling pointers in Postgres pathlists
Date: 2026-04-17 08:56:34
Message-ID: adab9758-f346-4263-93af-3e37b7b315b7@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

It looks like a community decision has been developing that Postgres should
separate optimisation features into 'conventional' and 'magic' classes [1]. This
has raised my concern that hidden contracts about pathlists' state and ordering
could lead to subtle bugs if an extension optimisation goes too far.

I think this topic is of interest because of the growing number of features that
impact path choice, such as ‘disable node’ or pg_plan_advice. Also, emerging
techniques that involve two or more levels of plan trees, like ‘eager
aggregation’, might catch another dangling pointer hidden in path lists for a
while. Don’t forget complicated cases with FDW and Custom nodes too.

For this purpose, a tiny debugging extension module, pg_pathcheck [2], has been
invented. It uses create_upper_paths_hook and planner_shutdown_hook. The
extension walks the entire Path tree, starting from the top PlannerInfo, then
recurses into glob::subroots, traversing each RelOptInfo and each pathlist.
Also, it traverses the path→subpath subtrees to ensure that potentially quite
complex path trees are covered when implemented as a single RelOptInfo. For each
pointer it visits, it checks if the NodeTag matches a known Path type. If not,
the memory was freed (and, with CLOBBER_FREED_MEMORY, set to 0x7F) or reused for
something else.

This approach is not free of caveats. For example, most Path nodes and many Plan
nodes fall within the 128-byte gap of the minimal allocated chunk. That means
freeing one path allows the optimiser to immediately allocate another Path node
at a potentially different query tree level. I had such a case at least once in
production. It was actually hard to realise, reproduce, and fix.

Running make check-world tests with the debug module loaded at startup revealed
many cases in which RelOptInfo structures contain dangling pointers. What
exactly do we see there?

The pathlist contents at the moment of an ‘Invalid’ path detection:

* ProjectionPath, Invalid — by far the most common, on JOIN RelOptInfos.
* ProjectionPath, Invalid, SortPath.
* AggPath, Invalid.
* NestPath, Invalid
* HashPath, Invalid
* cheapest_startup_path referencing a dangling pointer, on what looks
like a join of two partitions.
* cheapest_startup_path referencing a dangling pointer on a plain base
RelOptInfo.

The best-known problematic code example causing this issue is
apply_scanjoin_target_to_paths(), and the current_rel/final_rel game from commit
0927d2f46dd. Quickly fixing it, I see some more combinations have emerged:

* UniquePath, Invalid
* MergePath, Invalid
* SubqueryScanPath, Invalid
* SetOpPath, Invalid
* GatherPath, Path, Invalid
* AppendPath, AggPath, Invalid, AggPath
* HashPath, Invalid
* AppendPath, HashPath, Invalid

These new invalid references occur outside the originally identified code path,
showing that fixing one place does not address the broader issue (maybe my fixes
were wrong?). While some claim that the cost-dominance principle ('the cheapest
path is never invalid') provides safety, I have not found any acknowledgment of
this. As the planner is expanded, undocumented rules leave the system vulnerable.

The purpose of this email is basically to highlight the issue and raise a
discussion on how to solve it. Ashutosh designed a 'smart pointer' approach,
which seems the most balanced and bulletproof way. Another approach: 'used' flag
seems less interesting as well as local memory contexts - we should always
remember about multi-children cases that need freeing unnecessary paths in-place
to reduce memory consumption. But before diving into the code and identifying
origins of these cases, I’d like to know: is it an actual problem, or is the
cost-dominance contract enough?

[1]
https://www.postgresql.org/message-id/CA+TgmoaPgXYYEivQWxyVV=eYhN+T9JAgS9Xe4m7g9wVitVPF8g@mail.gmail.com
[2] https://github.com/danolivo/pg_pathcheck

--
regards, Andrei Lepikhov,
pgEdge

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2026-04-17 08:59:21 Re: Parallel Apply
Previous Message Peter Smith 2026-04-17 08:45:53 Re: EXCEPT TABLE - Case inconsistency for describe \d and \dRp+