Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Date: 2019-12-19 20:00:00
Message-ID: 5c7eaa87-8670-2008-3a15-337c133c12c2@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

19.12.2019 18:44, Tom Lane wrote:
> Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
>> Please look at the patch that implements (2). It makes `vcregress
>> recoverycheck` pass (and my demo restart test still passes too).
> I think we could be a little smarter here. If the problem is
> STATUS_DELETE_PENDING, doesn't that affect stat() as well? That is,
> if stat() succeeds, we needn't wait, independently of whether it
> says S_ISDIR or not? This seems like a noticeable improvement if
> true, because it would mean that ordinary file-permission failures
> also need not wait. We'd still get confused if we get a permission
> failure on a containing directory rather than the file itself, but
> that I'm prepared to dismiss as an uncommon case.
>
> Entirely-untested patch along this line attached.
I have tested it with my demo restart test and it works. Yes, stat() is
resulted in the CreateFile() call too, so it should fail the same way. 
I haven't managed to catch both CreateFile() and stat() failing with
STATUS_DELETE_PENDING (CreateFile() failed for me 37 times per 1000
loops), but I believe that sleep() will work correctly in this case.
`vcregress recoverytest` is passed too.
The DELETE_PENDING conflict now looks like this:
1 "8:50:44.9928200
PM","postgres.exe","2792","CreateFile","...\pgdata\postmaster.pid","SUCCESS","Desired
Access: Read Attributes, Delete, ..."
2 "8:50:44.9928951
PM","postgres.exe","2792","QueryAttributeTagFile","...\pgdata\postmaster.pid","SUCCESS",...
3 "8:50:44.9929114
PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","DELETE
PENDING","Desired Access: Generic Read, Disposition: Open, ..."
4 "8:50:44.9929162
PM","postgres.exe","2792","SetDispositionInformationFile","...\pgdata\postmaster.pid","SUCCESS","Delete:
True"
5 "8:50:44.9929377
PM","postgres.exe","2792","CloseFile","...\pgdata\postmaster.pid","SUCCESS",""
6 "8:50:44.9932617
PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Read Attributes, ..."
7 "8:50:45.1042027
PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Generic Read, ..."
8 "8:50:45.9926445
PM","pg_ctl.exe","2396","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Generic Read ..."
Here 1, 2, 4, 5 are calls behind unlink() in postmaster, 3 is
CreateFile() in pgwin32_open(), 6 is stat() in pgwin32_safestat(), then
we have sleep for 0.1 sec, 7 is CreateFile() in pgwin32_open() again,
and 8 is CreateFile() performed by a new pg_ctl instance.

Maybe we should change the condition to 'if (stat(fileName, &st) != 0 &&
(err = GetLastError()) == ERROR_ACCESS_DENIED)' to avoid unnecessary
sleep with a loop iteration...
With this change the same conflict looks like this:
1 "10:23:12.9940793
PM","postgres.exe","1908","CreateFile","...\pgdata\postmaster.pid","SUCCESS","Desired
Access: Read Attributes, ..."
2 "10:23:12.9941500
PM","postgres.exe","1908","QueryAttributeTagFile","...\pgdata\postmaster.pid","SUCCESS",
...
3 "10:23:12.9941713
PM","postgres.exe","1908","SetDispositionInformationFile","...\pgdata\postmaster.pid","SUCCESS","Delete:
True"
4 "10:23:12.9941766
PM","pg_ctl.exe","2420","CreateFile","...\pgdata\postmaster.pid","DELETE
PENDING","Desired Access: Generic Read ..."
5 "10:23:12.9941936
PM","postgres.exe","1908","CloseFile","...\pgdata\postmaster.pid","SUCCESS",""
6 "10:23:12.9945254
PM","pg_ctl.exe","2420","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Read Attributes, ..."
7 "10:23:13.9940261
PM","pg_ctl.exe","1676","CreateFile","...\pgdata\postmaster.pid","NAME
NOT FOUND","Desired Access: Generic Read, ..."
Here 1, 2, 3, 5 are calls behind unlink() in postmaster, 4 is
CreateFile() in pgwin32_open(), 6 is stat() in pgwin32_safestat(), 7 is
CreateFile() performed by a new pg_ctl instance.
`vcregress recoverytest` is passed.
The log of access to the logical decoding slot now looks like this:
1 "10:37:22.9496662
PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY","Desired Access: Generic Read, ..."
2 "10:37:22.9498061
PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","Desired
Access: Read Attributes, ..."
3 "10:37:22.9498478
PM","postgres.exe","2956","QueryInformationVolume","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","VolumeCreationTime:
... "
4 "10:37:22.9498634
PM","postgres.exe","2956","QueryAllInformationFile","...\pgdata\pg_replslot\otherdb_slot","BUFFER
OVERFLOW","CreationTime: ..."
5 "10:37:22.9498860
PM","postgres.exe","2956","CloseFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS",""
6 "10:37:22.9499936
PM","postgres.exe","2956","QueryOpen","...\pgdata\pg_replslot\otherdb_slot","FAST
IO DISALLOWED",""
7 "10:37:22.9500733
PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","Desired
Access: ..."
8 "10:37:22.9500973
PM","postgres.exe","2956","QueryNetworkOpenInformationFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","CreationTime:..."
9 "10:37:22.9501108
PM","postgres.exe","2956","CloseFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS",""

> BTW ... it's likely that stat() here is actually going to invoke
> pgwin32_safestat(), which has its own opinions about this, and
> indeed seems to think we'll get ERROR_DELETE_PENDING. I think
> that's harmless here, but it makes me wonder if we should use
> a native Windows API instead of going through stat().
Yes, I would call native stat(), as we don't need to perform an
excessive call to get file size (for directories).
It seems that the check for ERROR_DELETE_PENDING was added to
pgwin32_safestat() blindly, the issue wasn't reproduced at that time:
https://www.postgresql.org/message-id/CAB7nPqRJV6trFta-Qzgi6j2feuYR2ZC%2BKHvWdHnbpDG2scTrxw%40mail.gmail.com

>> As open(dir) is getting a little more expensive than before, maybe it's
>> still worthwhile to patch fsync*(..., isdir,...). I can prepare such a
>> patch if so.
> I think we should leave that for later, so that the buildfarm can
> actually test whatever logic we put in here.
Agreed. It doesn't directly affect this bug and should be done separately.

Best regards,
Alexander

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2019-12-19 20:09:45 Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Previous Message Tom Lane 2019-12-19 15:44:08 Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)