Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Date: 2019-12-18 06:00:00
Message-ID: b2b8cc4c-b7d7-0d1b-f6ee-a02bbef5bdd0@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello,
18.12.2019 6:25, Kyotaro Horiguchi wrote:
> At Tue, 17 Dec 2019 18:50:26 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
>>
>> Now, it's mighty suspicious that this has a timeout of 30 sec
>> which is exactly how long we made pgwin32_open retry for ---
>> but how is it that this test is causing an ERROR_ACCESS_DENIED
>> failure? And if it was, how did we get past that before?
Not exactly, we set timeout for 1 sec. It seems we have this timeout
occurred many times.
>> It does look suspiciously like this wait is triggering on these
>> machines and somehow getting masked in most other test cases.
>> If you compare the per-step runtimes in jacana's last successful
>> run,
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-12-16%2013%3A01%3A26
>>
>> with those after this patch went in,
>>
>> https://buildfarm.postgresql.org/cgi-bin/shocw_log.pl?nm=jacana&dt=2019-12-17%2004%3A16%3A40
>>
>> there's clearly something very wrong. fairywren likewise.
> The immediate cause of the failure seems to be a slowdown of startup
> or recovery. It took only 8 minutes in the successful case but 25
> minutes even though stopped on the way in the failure case.
>
> In the failure case, it took 30 seconds to move only 26.1MB from
> 15E8C48 to 30013A0. just around 0.3 seconds for a move of the same
> size (15E9C48->3000DE8 (26.1MB)) in the successful case (*1). In the
> sucessful case, the time took from archiving 00000002.history to
> removing old segments is 0.15 s, but about 3.5 s in the failure case
> (*2).
>
> Seemingly the immediate cause of the timeout looks like just an
> extreme slowdown of file read, but it took 2 seconds to move within a
> page from 3000C18 to 3000C50. So I suspect that 6d7547c219 might slow
> down readTimeLineHistory() significantly in the case.
>
> regards.
>
>
Yes, some tests are really slowed down (namely, test-decoding-check,
pg_basebackup-check, ...).
ProcessMonitor shows:

"8:11:57.1428800
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","SUCCESS",...
"8:11:57.1429851
AM","postgres.exe","2640","SetRenameInformationFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot\state.tmp","SUCCESS",...
"8:11:57.1432041
AM","postgres.exe","2640","CloseFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","SUCCESS",""
"8:11:57.1432964
AM","postgres.exe","2640","CloseFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot\state","SUCCESS",""
"8:11:57.1435137
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot\state","SUCCESS",...
"8:11:57.1435709
AM","postgres.exe","2640","CloseFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot\state","SUCCESS",""
"8:11:57.1437251
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:57.2451100
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:57.3571077
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:57.4643527
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:57.5778052
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:57.6825679
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:57.7921704
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:57.9018507
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:58.0109597
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:58.1204329
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:58.2300829
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot\otherdb_slot","IS
DIRECTORY",...
"8:11:58.2302124
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:58.3393345
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:58.4484310
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:58.5584079
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:58.6683586
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:58.7773396
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:58.8858091
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:58.9955264
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:59.1050665
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:59.2137752
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:59.3234641
AM","postgres.exe","2640","CreateFile","...\t_006_logical_decoding_master_data\pgdata\pg_replslot","IS
DIRECTORY",...
"8:11:59.3261361
AM","postgres.exe","2640","QueryOpen","C:\Windows\System32\kernel.appcore.dll","FAST
IO DISALLOWED",""

https://web.archive.org/web/20150317121919/http://support.microsoft.com/en-us/kb/113996
tells that STATUS_FILE_IS_A_DIRECTORY is mapped to ERROR_ACCESS_DENIED too.
It seems that the cause of the issue is in fsync_fname_ext:
    fd = OpenTransientFile(fname, flags);

    /*
     * Some OSs don't allow us to open directories at all (Windows returns
     * EACCES), just ignore the error in that case.  If desired also
silently
     * ignoring errors about unreadable files. Log others.
     */
    if (fd < 0 && isdir && (errno == EISDIR || errno == EACCES))
        return 0;
Maybe we need to take isdir into account before calling
OpenTransientFile on Windows...
Please look at the simple patch that makes the recoverycheck test pass.

Best regards,
Alexander

Attachment Content-Type Size
fix- fsync_fname_ext.patch text/x-patch 1.3 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Roman Cervenak 2019-12-18 08:49:22 Re: Memory leak (possibly connected to postgis) leading to server crash
Previous Message Kaijiang Chen 2019-12-18 03:25:32 Fwd: weird long time query