| From: | Amit Langote <amitlangote09(at)gmail(dot)com> |
|---|---|
| To: | Rahila Syed <rahilasyed90(at)gmail(dot)com> |
| Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
| Subject: | Re: Segmentation fault on proc exit after dshash_find_or_insert |
| Date: | 2025-12-02 04:10:29 |
| Message-ID: | CA+HiwqG7044FAMr_MX3DkZKds643gBVQRxDgWo=E=AHLH4VYRg@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Rahila,
On Fri, Nov 21, 2025 at 8:45 PM Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>
> Hi,
>
> If a process encounters a FATAL error after acquiring a dshash lock but before releasing it,
> and it is not within a transaction, it can lead to a segmentation fault.
>
> The FATAL error causes the backend to exit, triggering proc_exit() and similar functions.
> In the absence of a transaction, LWLockReleaseAll() is delayed until ProcKill. ProcKill is
> an on_shmem_exit callback, and dsm_backend_shutdown() is called before any
> on_shmem_exit callbacks are invoked.
> Consequently, if a dshash lock was acquired before the FATAL error occurred, the lock
> will only be released after dsm_backend_shutdown() detaches the DSM segment containing
> the lock, resulting in a segmentation fault.
Thanks for the report.
I am not super familiar with this code path, but this raises a broader
question for me: are there other resources residing in DSM (besides
LWLocks) that might be accessed during the shutdown sequence?
We know dshash and dsa locks (LWLocks) face this risk because ProcKill
runs as an on_shmem_exit callback, which happens after
dsm_backend_shutdown() has already detached the memory.
This patch fixes the specific case for LWLocks, but if there are any
other on_shmem_exit callbacks that attempt to touch DSM memory, they
would still trigger a similar segfault. Do we need to worry about
other cleanup routines, or is ProcKill the only consumer of DSM data
at this stage?
> Please find a reproducer attached. I have modified the test_dsm_registry module to create
> a background worker that does nothing but throws a FATAL error after acquiring the dshash lock.
> The reason this must be executed in the background worker is to ensure it runs without a transaction.
>
> To trigger the segmentation fault, apply the 0001-Reproducer* patch, run make install in the
> test_dsm_registry module, specify test_dsm_registry as shared_preload_libraries in postgresql.conf,
> and start the server.
>
> Please find attached a fix to call LWLockReleaseAll() early in the shmem_exit() routine. This ensures
> that the dshash lock is released before dsm_backend_shutdown() is called. This will also ensure that
> any subsequent callbacks invoked in shmem_exit() will not fail to acquire any lock.
@@ -229,6 +230,14 @@ shmem_exit(int code)
{
shmem_exit_inprogress = true;
+ /*
+ * Make sure we release any pending locks so that any callbacks called
+ * subsequently do not fail to acquire any locks. This also fixes a seg
+ * fault due to releasing a dshash lock after the dsm segment containing
+ * the lock has been detached by dsm_backend_shutdown().
+ */
+ LWLockReleaseAll();
+
/*
* Call before_shmem_exit callbacks.
*
Again, not an expert, but I am concerned about placing
LWLockReleaseAll() at the very top, before before_shmem_exit()
callbacks run.
One of those callbacks might rely on locks being held or assume the
consistency of shared memory structures protected by those locks. It
seems safer to sandwich the release between the two callback lists:
after before_shmem_exit is done, but before dsm_backend_shutdown()
runs.
If we move it there, we should also reword the comment to focus on the
resource lifespan rather than just the symptom ("seg fault"):
/*
* Release all LWLocks before we tear down DSM segments.
*
* LWLocks that reside in dynamic shared memory (e.g., dshash partition
* locks) must be released before the backing segment is detached by
* dsm_backend_shutdown(). If we wait for ProcKill (via on_shmem_exit),
* the memory will already be unmapped, leading to access violations.
*/
--
Thanks, Amit Langote
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Zhijie Hou (Fujitsu) | 2025-12-02 04:19:09 | RE: Newly created replication slot may be invalidated by checkpoint |
| Previous Message | Kirill Reshke | 2025-12-02 04:02:34 | Re: freespace buffer locking |