| From: | Bryan Green <dbryan(dot)green(at)gmail(dot)com> | 
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> | 
| Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects | 
| Date: | 2025-11-03 15:25:11 | 
| Message-ID: | 3becb971-0e37-4d93-a8ab-747dacd99ce1@gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 11/3/2025 9:19 AM, Andres Freund wrote:
> Hi,
> 
> On 2025-11-03 09:12:03 -0600, Bryan Green wrote:
>> We just need to call CreateJobObject() in PostmasterMain(), configure
>> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
>> Children inherit membership automatically. When the job handle closes on
>> postmaster exit, the kernel terminates all children atomically. This is
>> kernel-enforced with no polling and no race conditions.
> 
> What happens if a postmaster child exits irregularly? Is postmaster terminated
> as well?
> 
No, Job Objects are unidirectional. KILL_ON_JOB_CLOSE only acts when the
postmaster (which holds the job handle) exits. Backend crashes are
handled through PostgreSQL's existing crash recovery mechanism - the
postmaster detects the crash via WaitForMultipleObjects() and initiates
recovery as normal.
The Job Object only takes action when the job handle closes, which
happens when the postmaster exits. It's analogous to a Unix process
group - sending SIGTERM to the group leader kills the group, but
children dying doesn't affect the parent.
>> The patch has been tested on Windows 10/11 with both MSVC and MinGW
>> builds. Nested jobs fail gracefully as expected. Clean shutdown is
>> unaffected. Crash tests with taskkill /F, debugger abort, and access
>> violations all correctly terminate children immediately with zero orphans.
>>
>> This patch does not include automated tests because the core
>> functionality (orphan prevention on crash) requires simulating process
>> termination, which is difficult to test reliably in CI.
> 
> Why is it difficult to test in CI? We do some related tests in
> 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add
> tests for postmaster?
>
Fair point. I was hesitant because testing the actual orphan prevention
requires killing the postmaster while backends are active, which seemed
fragile. But you're right that we already test similar scenarios.
I can add a test to 013_crash_restart.pl (or a new Windows-specific test
file) that:
1. Starts server with active backend
2. Kills postmaster ungracefully (taskkill /F)
3. Verifies backend process terminates automatically
4. Confirms clean restart
Would that be sufficient, or do you have other test scenarios in mind?
> Greetings,
> 
> Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Andres Freund | 2025-11-03 15:29:09 | Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects | 
| Previous Message | Andres Freund | 2025-11-03 15:20:57 | Re: Report bytes and transactions actually sent downtream |