Re: BUG #19416: Backend SIGSEGV in ExecShutdownHashJoin/ExecHashTableDetach/dsa_free

From: Chris Hofstaedtler <ch+pg(at)zeha(dot)at>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #19416: Backend SIGSEGV in ExecShutdownHashJoin/ExecHashTableDetach/dsa_free
Date: 2026-04-07 08:53:36
Message-ID: adTAq--RZYaHjxP9@zeha.at
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

* Thomas Munro <thomas(dot)munro(at)gmail(dot)com> [260328 08:03]:
>On Thu, Feb 26, 2026 at 11:28 AM PG Bug reporting form
><noreply(at)postgresql(dot)org> wrote:
>> #0 0x0000564830c3c9d1 in dsa_free (area=0x5648621a9280, dp=<optimized out>)
>> at ./build/../src/backend/utils/mmgr/dsa.c:845
>> 845 ./build/../src/backend/utils/mmgr/dsa.c: No such file or directory.
>> (gdb) bt full
>> #0 0x0000564830c3c9d1 in dsa_free (area=0x5648621a9280, dp=<optimized out>)
>> at ./build/../src/backend/utils/mmgr/dsa.c:845
>> segment_map = 0x5648621a92b8
>> pageno = 0
>> span_pointer = 0
>> span = <optimized out>
>> superblock = <optimized out>
>> object = <optimized out>
>> size = <optimized out>
>> size_class = <optimized out>
>> #1 0x0000564830917ae1 in ExecHashTableDetach (hashtable=0x564862147fa0) at
>> ./build/../src/backend/executor/nodeHash.c:3439
>> i = <optimized out>
>> pstate = 0x7ef9970b5040
>
>Hmm, it's not supposed to be possible for two backends to reach the
>dsa_free() at nodeHash.c:3439. But it does smell a bit like that
>that's what happened... hmmm. We were in _RUN phase, and then this
>barrier should only let one process through to _FREE. I'd be
>interested in the contents of pstate->build_barrier in frame 1 if you
>get another core file with the same stack in it.

Sorry for the lack of followup. Currently we are thinking our
problem matches the bug described in https://postgr.es/m/044401dcabac$fe432490$fac96db0$@icloud.com
("[BUG + PATCH] DSA pagemap out-of-bounds in make_new_segment
odd-sized path").

>Can you share the
>outline of the query plan? Does it happen to have a full/right PHJ
>in it?

It appears we forgot to capture more of the logs (before they expired).
Here is a plan from a smaller machine carrying a similar dataset -
unfortunately I cannot guarantee it's the same plan, although it seems close.
Indeed it has a "Parallel Hash Right Join" node.

Nested Loop (cost=13030498.63..46923224.94 rows=21144 width=36)
-> Nested Loop (cost=13030498.19..46911677.41 rows=21144 width=44)
-> GroupAggregate (cost=13030497.75..46808339.62 rows=54669 width=2120)
Group Key: subscriber.id
Filter: (last_agg(nsr.voicemail_present ORDER BY nsr.sort) = 'true'::text)
-> Gather Merge (cost=13030497.75..19559670.85 rows=53993659 width=23)
Workers Planned: 5
-> Sort (cost=13029497.68..13056494.51 rows=10798732 width=23)
Sort Key: subscriber.id, nsr.sort
-> Parallel Hash Left Join (cost=10633766.78..11767970.44 rows=10798732 width=23)
Hash Cond: (subscriber.id = subscription.subscriber_id)
Filter: ((nsr.postpaid IS NULL) OR (nsr.postpaid = (subscription.id IS NOT NULL)))
-> Hash Left Join (cost=10585690.41..11619799.44 rows=11184278 width=24)
Hash Cond: (subscriber.id = subscriber_options.subscriber_id)
Filter: ((nsr.option_ids IS NULL) OR (nsr.option_ids <@ subscriber_options.option_ids))
-> Parallel Hash Right Join (cost=9142123.87..10141640.84 rows=13177920 width=49)
Hash Cond: ((optionvalue_lso.subscriber_id = subscriber.id) AND (optionvalue_lso.option_id = nsr.optionvalue_option_id) AND (optionvalue_lso.value = nsr.optionvalue_value))
Filter: ((nsr.optionvalue_option_id IS NULL) OR (optionvalue_lso.id IS NOT NULL))
-> Parallel Seq Scan on lnk_subscriber_option optionvalue_lso (cost=0.00..728660.20 rows=1901347 width=21)
Filter: ((activationdate <= now()) AND (deletiondate > now()))
-> Parallel Hash (cost=8769950.27..8769950.27 rows=13177920 width=62)
-> Parallel Hash Join (cost=2888653.49..8769950.27 rows=13177920 width=62)
Hash Cond: (lnk_subscriber_simcard.subscriber_id = subscriber.id)
Join Filter: (((nsr.brand_id IS NULL) OR (nsr.brand_id = subscriber.brand_id)) AND ((nsr.registered IS NULL) OR (nsr.registered = subscriber.registered)) AND ((nsr.customer_category_ids IS NULL) OR (basedata.customercategory_id = ANY (nsr.customer_category_ids))) AND ((nsr.subscriberstatus_ids IS NULL) OR (lnk_subscriber_status.subscriberstatus_id = ANY (nsr.subscriberstatus_ids))))
-> Nested Loop (cost=1148075.17..6707980.64 rows=34348223 width=126)
Join Filter: ((simcard.imsi >= nsr.imsi_start) AND (simcard.imsi <= nsr.imsi_end))
-> Parallel Hash Join (cost=1148075.17..1298129.01 rows=1807801 width=12)
Hash Cond: (lnk_subscriber_simcard.simcard_id = simcard.id)
-> Parallel Seq Scan on lnk_subscriber_simcard (cost=0.00..145308.36 rows=1807801 width=8)
Filter: ((activationdate <= now()) AND (deletiondate > now()))
-> Parallel Hash (cost=1118684.52..1118684.52 rows=2351252 width=12)
-> Parallel Seq Scan on simcard (cost=0.00..1118684.52 rows=2351252 width=12)
-> Materialize (cost=0.00..7.56 rows=171 width=138)
-> Seq Scan on network_settings_rule nsr (cost=0.00..6.71 rows=171 width=138)
-> Parallel Hash (cost=1717950.74..1717950.74 rows=1810207 width=21)
-> Parallel Hash Left Join (cost=778201.62..1717950.74 rows=1810207 width=21)
Hash Cond: (subscriber.id = lnk_subscriber_basedata.subscriber_id)
-> Parallel Hash Join (cost=342951.95..1270264.89 rows=1810207 width=17)
Hash Cond: (lnk_subscriber_status.subscriber_id = subscriber.id)
-> Parallel Bitmap Heap Scan on lnk_subscriber_status (cost=102927.10..1025488.25 rows=1810207 width=8)
Recheck Cond: (deletiondate > now())
Filter: (activationdate <= now())
-> Bitmap Index Scan on idx_subscriber_status_deletiondate (cost=0.00..100664.35 rows=9051037 width=0)
Index Cond: (deletiondate > now())
-> Parallel Hash (cost=216653.82..216653.82 rows=1869682 width=9)
-> Parallel Seq Scan on subscriber (cost=0.00..216653.82 rows=1869682 width=9)
-> Parallel Hash (cost=413397.43..413397.43 rows=1748179 width=12)
-> Parallel Hash Left Join (cost=228746.63..413397.43 rows=1748179 width=12)
Hash Cond: (lnk_subscriber_basedata.basedata_id = basedata.id)
-> Parallel Seq Scan on lnk_subscriber_basedata (cost=0.00..180061.84 rows=1748179 width=12)
Filter: ((now() >= activationdate) AND (now() < deletiondate))
-> Parallel Hash (cost=201487.39..201487.39 rows=2180739 width=8)
-> Parallel Seq Scan on basedata (cost=0.00..201487.39 rows=2180739 width=8)
-> Hash (cost=1434914.41..1434914.41 rows=692170 width=36)
-> Subquery Scan on subscriber_options (cost=1345069.22..1434914.41 rows=692170 width=36)
-> HashAggregate (cost=1345069.22..1427992.71 rows=692170 width=36)
Group Key: lnk_subscriber_option.subscriber_id
Planned Partitions: 8
-> Index Scan using idx_option_subscriber_combined on lnk_subscriber_option (cost=0.57..1033129.48 rows=9506735 width=8)
Index Cond: ((activationdate <= now()) AND (deletiondate > now()))
-> Parallel Hash (cost=46189.54..46189.54 rows=150946 width=8)
-> Parallel Hash Join (cost=14649.64..46189.54 rows=150946 width=8)
Hash Cond: (lnk_subscription_status.subscription_id = subscription.id)
-> Hash Join (cost=1.27..31144.93 rows=150946 width=4)
Hash Cond: (lnk_subscription_status.subscriptionstatus_id = subscriptionstatus.id)
-> Parallel Seq Scan on lnk_subscription_status (cost=0.00..30421.01 rows=211324 width=8)
Filter: ((activationdate <= now()) AND (deletiondate > now()))
-> Hash (cost=1.14..1.14 rows=10 width=4)
-> Seq Scan on subscriptionstatus (cost=0.00..1.14 rows=10 width=4)
Filter: activestate
-> Parallel Hash (cost=11245.39..11245.39 rows=272239 width=8)
-> Parallel Seq Scan on subscription (cost=0.00..11245.39 rows=272239 width=8)
-> Index Scan using idx_lnk_subscriber_msisdn_subscriber_deletiondate on lnk_subscriber_msisdn (cost=0.44..1.87 rows=1 width=8)
Index Cond: ((subscriber_id = subscriber.id) AND (deletiondate > now()))
Filter: (activationdate <= now())
-> Index Scan using pkey_msisdn on msisdn (cost=0.43..0.53 rows=1 width=20)
Index Cond: (id = lnk_subscriber_msisdn.msisdn_id)

For now our plan is to wait for the next 18.x stable release, and then reenable
enable_parallel_hash. Then see if it occours again.

As a side-note, my colleagues noted that the problematic query has
drastic better timings with enable_parallel_hash = off. Something on our list
for future exploration.

Best,
Chris

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message John Naylor 2026-04-07 08:56:51 Re: BUG #19450: Where is checksum_block.inc.c after master install?
Previous Message Zhijie Hou (Fujitsu) 2026-04-07 08:13:38 RE: BUG #19360: Bug Report: Logical Replication initial sync fails with "conflict=update_origin_differs" PG12 toPG18