[PATCH] Fix orphaned backend processes on Windows using Job Objects

From: Bryan Green <dbryan(dot)green(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH] Fix orphaned backend processes on Windows using Job Objects
Date: 2025-11-03 15:12:03
Message-ID: 880214db-ab8c-4b9e-852c-b0f6d90d3f3d@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Greetings,

When the postmaster exits unexpectedly on Windows (crash, kill, debugger
abort), backend processes continue running. Windows lacks any equivalent
to Unix's getppid() orphan detection. These orphaned backends hold locks
and shared memory, preventing clean restart. This leads to a delay in
restarts and manual killing of orphans.

The problem is easy to reproduce. Start postgres, open a transaction
with LOCK TABLE, then kill the postmaster with taskkill /F. The backend
continues running and restart fails. Manual cleanup is required.

Current approaches (inherited event handles, shared memory flags) depend
on the postmaster running code during exit. A segfault or kill bypasses
all of that.

My proposed solution is to use Windows Job Objects with KILL_ON_JOB_CLOSE.

We just need to call CreateJobObject() in PostmasterMain(), configure
with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
Children inherit membership automatically. When the job handle closes on
postmaster exit, the kernel terminates all children atomically. This is
kernel-enforced with no polling and no race conditions.

Job creation can fail if postgres runs under an existing job (service
managers, debuggers). Windows 7 disallows nested jobs. We detect this
with IsProcessInJob(), and if AssignProcessToJobObject() returns
ERROR_ACCESS_DENIED, we log and continue without orphan protection.

KILL_ON_JOB_CLOSE doesn't interfere with clean shutdown. Normal shutdown
signals backends via SetEvent, they exit, postmaster exits, job closes.
Nothing left to kill. The flag only fires during crashes when backends
are still running - exactly when forced termination is correct.

The code is ~200 lines in pg_job_object.c, less than win32/signal.c
(~500 lines). It fails gracefully and works regardless of how postgres
is started, unlike service manager approaches. This avoids polling
unreliability.

The patch has been tested on Windows 10/11 with both MSVC and MinGW
builds. Nested jobs fail gracefully as expected. Clean shutdown is
unaffected. Crash tests with taskkill /F, debugger abort, and access
violations all correctly terminate children immediately with zero orphans.

This patch does not include automated tests because the core
functionality (orphan prevention on crash) requires simulating process
termination, which is difficult to test reliably in CI.

Patch attached. Can add documentation if this approach is approved.

Thoughts?

Bryan Green

Attachment Content-Type Size
0001-Use-Windows-Job-Objects-to-prevent-orphaned-child-pr.patch text/plain 11.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-11-03 15:13:02 Re: Implement waiting for wal lsn replay: reloaded
Previous Message Álvaro Herrera 2025-11-03 15:06:58 Re: Implement waiting for wal lsn replay: reloaded