Intermittent pg_ctl failures on Windows

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Intermittent pg_ctl failures on Windows
Date: 2018-03-10 22:48:28
Message-ID: 16922.1520722108@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The buildfarm's Windows members occasionally show weird pg_ctl failures,
for instance this recent case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2018-03-10%2020%3A30%3A20

### Restarting node "master"
# Running: pg_ctl -D G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata -l G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log restart
waiting for server to shut down.... done
server stopped
waiting for server to start....The process cannot access the file because it is being used by another process.
stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out! system pg_ctl failed

or this one:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2017-12-29%2023%3A30%3A24

### Stopping node "subscriber" using mode fast
# Running: pg_ctl -D c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata -m fast stop
waiting for server to shut down....pg_ctl: could not open PID file "c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid": Permission denied
Bail out! system pg_ctl failed

I'd been writing these off as Microsoft randomness and/or antivirus
interference, but it suddenly occurred to me that there might be a
consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
for server start/stop, it is trying to read postmaster.pid more-or-less
concurrently with the postmaster writing to that file. On Unix that's not
much of a problem, but I believe that on Windows you have to specifically
open the file with sharing enabled, or you get error messages like these.
The postmaster should be enabling sharing, because port.h redirects
open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing flags.
But it only does that #ifndef FRONTEND. So pg_ctl is just using naked
open(), which could explain these failures.

If this theory is accurate, it should be pretty easy to replicate the
problem if you modify the postmaster to hold postmaster.pid open longer
when rewriting it, e.g. stick fractional-second sleeps into CreateLockFile
and AddToDataDirLockFile.

I'm not in a position to investigate this in detail nor test a fix,
but I think somebody should.

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-03-10 23:11:57 Re: Parallel Aggregates for string_agg and array_agg
Previous Message Andrew Dunstan 2018-03-10 22:37:17 VACUUM FULL vs dropped columns