add_partial_path() may remove dominated path but still in use

From: Kohei KaiGai <kaigai(at)heterodb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: add_partial_path() may remove dominated path but still in use
Date: 2018-12-28 04:21:30
Message-ID: CAOP8fzahwpKJRTVVTqo2AE=mDTz_efVzV6Get_0=U3SO+-ha1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I've investigated a crash report of PG-Strom for a few days, then I doubt
add_partial_path() can unexpectedly release dominated old partial path
but still referenced by other Gather node, and it leads unexpected system
crash.

Please check at the gpuscan.c:373
https://github.com/heterodb/pg-strom/blob/master/src/gpuscan.c#L373

The create_gpuscan_path() constructs a partial custom-path, then it is
added to the partial_pathlist of the baserel.
If both of old and new partial paths have no pathkeys and old-path has
larger cost, add_partial_path detaches the old path from the list, then
calls pfree() to release old_path itself.

On the other hands, the baserel may have GatherPath which references
the partial-path on its pathlist. Here is no check whether the old partial-
paths are referenced by others, or not.

To ensure my assumption, I tried to inject elog() before/after the call of
add_partial_path() and just before the pfree(old_path) in add_partial_path().

----------------------------------------------------------------
dbt3=# explain select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1994-01-01'
and l_shipdate < cast(date '1994-01-01' + interval '1 year' as date)
and l_discount between 0.08 - 0.01 and 0.08 + 0.01
and l_quantity < 24 limit 1;
INFO: GpuScan:389 GATHER(0x28f3c30), SUBPATH(0x28f3f88): {GATHERPATH
:pathtype 44 :parent_relids (b 1) :required_outer (b) :parallel_aware
false :parallel_safe false :parallel_workers 0 :rows 151810
:startup_cost 1000.00 :total_cost 341760.73 :pathkeys <> :subpath
{PATH :pathtype 18 :parent_relids (b 1) :required_outer (b)
:parallel_aware true :parallel_safe true :parallel_workers 2 :rows
63254 :startup_cost 0.00 :total_cost 325579.73 :pathkeys <>}
:single_copy false :num_workers 2}
INFO: add_partial_path:830 old_path(0x28f3f88) is removed
WARNING: could not dump unrecognized node type: 2139062143
INFO: GpuScan:401 GATHER(0x28f3c30), SUBPATH(0x28f3f88): {GATHERPATH
:pathtype 44 :parent_relids (b 1) :required_outer (b) :parallel_aware
false :parallel_safe false :parallel_workers 0 :rows 151810
:startup_cost 1000.00 :total_cost 341760.73 :pathkeys <> :subpath
{(HOGE)} :single_copy false :num_workers 2}
----------------------------------------------------------------

At the L389, GatherPath in the baresel->pathlist is healthy. Its
subpath (0x28f3f88) is
a valid T_Scan path node.
Then, gpuscan.c adds a cheaper path-node so add_partial_path()
considers the above
subpath (0x28f3f88) is dominated by the new custom-path, and release it.
So, elog() at L401 says subpath has unrecognized node type: 2139062143
== 0x7f7f7f7f
that implies the memory region was already released by pfree().

Reference counter or other mechanism to tack referenced paths may be an idea
to avoid unintentional release of path-node.
On the other hands, it seems to me the pfree() at add_path /
add_partial_path is not
a serious memory management because other objects referenced by the path-node
are not released here.
It is sufficient if we detach dominated path-node from the pathlist /
partial_pathlist.

How about your opinions?

Thanks,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai(at)heterodb(dot)com>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-12-28 04:43:14 Re: removal of dangling temp tables
Previous Message Alvaro Herrera 2018-12-28 03:37:41 Re: Minor comment fix for pg_config_manual.h