Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Date: 2019-12-18 03:25:26
Message-ID: 20191218.122526.233778911708036076.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

At Tue, 17 Dec 2019 18:50:26 -0500, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
> > 16.12.2019 23:17, Tom Lane wrote:
> >> I've pushed this with adjustment of the other loop and some fooling
> >> with the comment --- I still don't see a need to cite stackoverflow
> >> as an authority.
>
> > Thank you! I hope now Windows machines will concede the first place of
> > the race for failures.
>
> Seems like we're not there yet :-(. Buildfarm members jacana and
> fairywren have been failing the recovery tests, in v10 and up,
> since this went in. It looks like we're getting a timeout on
> this step in 010_logical_decoding_timelines.pl:
>
> # Should be able to read from slot created before base backup
> ($ret, $stdout, $stderr) = $node_replica->psql(
> 'postgres',
> "SELECT data FROM pg_logical_slot_peek_changes('before_basebackup', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');",
> timeout => 30);
>
> Now, it's mighty suspicious that this has a timeout of 30 sec
> which is exactly how long we made pgwin32_open retry for ---
> but how is it that this test is causing an ERROR_ACCESS_DENIED
> failure? And if it was, how did we get past that before?
>
> It does look suspiciously like this wait is triggering on these
> machines and somehow getting masked in most other test cases.
> If you compare the per-step runtimes in jacana's last successful
> run,
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-12-16%2013%3A01%3A26
>
> with those after this patch went in,
>
> https://buildfarm.postgresql.org/cgi-bin/shocw_log.pl?nm=jacana&dt=2019-12-17%2004%3A16%3A40
>
> there's clearly something very wrong. fairywren likewise.

The immediate cause of the failure seems to be a slowdown of startup
or recovery. It took only 8 minutes in the successful case but 25
minutes even though stopped on the way in the failure case.

In the failure case, it took 30 seconds to move only 26.1MB from
15E8C48 to 30013A0. just around 0.3 seconds for a move of the same
size (15E9C48->3000DE8 (26.1MB)) in the successful case (*1). In the
sucessful case, the time took from archiving 00000002.history to
removing old segments is 0.15 s, but about 3.5 s in the failure case
(*2).

Seemingly the immediate cause of the timeout looks like just an
extreme slowdown of file read, but it took 2 seconds to move within a
page from 3000C18 to 3000C50. So I suspect that 6d7547c219 might slow
down readTimeLineHistory() significantly in the case.

regards.

*1: jacana - WAL reading
Success
2019-12-16 09:01:01.805 EST [5df78e1d.b74:4] 010_logical_decoding_timelines.pl LOG: starting logical decoding for slot "before_basebackup"
2019-12-16 09:01:01.805 EST [5df78e1d.b74:5] 010_logical_decoding_timelines.pl DETAIL: Streaming transactions committing after 0/15CC960, reading WAL from 0/15CC928.
...
2019-12-16 09:01:01.836 EST [5df78e1d.b74:34] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn: proposed 0/3000DE8, after 0/3000DE8, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960

Fail
2019-12-17 01:35:02.235 EST [5df87716.2ddc:4] 010_logical_decoding_timelines.pl LOG: starting logical decoding for slot "before_basebackup"
2019-12-17 01:35:02.235 EST [5df87716.2ddc:5] 010_logical_decoding_timelines.pl DETAIL: Streaming transactions committing after 0/15CC960, reading WAL from 0/15CC928.
...
2019-12-17 01:35:24.195 EST [5df87716.2ddc:33] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn: proposed 0/3000C18, after 0/3000C18, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960
2019-12-17 01:35:24.195 EST [5df87716.2ddc:34] 010_logical_decoding_timelines.pl DEBUG: serializing snapshot to pg_logical/snapshots/0-3000C50.snap
2019-12-17 01:35:26.379 EST [5df87716.2ddc:35] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn: proposed 0/3000C50, after 0/3000C50, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960
2019-12-17 01:35:26.379 EST [5df87716.2ddc:36] 010_logical_decoding_timelines.pl DEBUG: serializing snapshot to pg_logical/snapshots/0-3000D68.snap
2019-12-17 01:35:28.575 EST [5df87716.2ddc:37] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn: proposed 0/3000D68, after 0/3000D68, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960
2019-12-17 01:35:28.575 EST [5df87716.2ddc:38] 010_logical_decoding_timelines.pl DEBUG: serializing snapshot to pg_logical/snapshots/0-30012F0.snap
2019-12-17 01:35:30.775 EST [5df87716.2ddc:39] 010_logical_decoding_timelines.pl DEBUG: failed to increase restart lsn: proposed 0/30012F0, after 0/30012F0, current candidate 0/15E9C48, current after 0/15E9C48, flushed up to 0/15CC960
2019-12-17 01:35:30.775 EST [5df87716.2ddc:40] 010_logical_decoding_timelines.pl DEBUG: serializing snapshot to pg_logical/snapshots/0-30013A0.snap

====
*2: jacana - checkpoint
Success
2019-12-16 09:01:01.399 EST [5df78e1d.170c:1] DEBUG: archived write-ahead log file "00000002.history"
2019-12-16 09:01:01.430 EST [5df78e1d.170c:2] DEBUG: archived write-ahead log file "000000010000000000000003.partial"
2019-12-16 09:01:01.445 EST [5df78e1b.2f60:4] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000
2019

Fail
2019-12-17 01:35:01.950 EST [5df87715.2e0c:1] DEBUG: archived write-ahead log file "00000002.history"
2019-12-17 01:35:03.080 EST [5df87715.2e0c:2] DEBUG: archived write-ahead log file "000000010000000000000003.partial"
2019-12-17 01:35:05.389 EST [5df876ed.3370:4] DEBUG: attempting to remove WAL segments older than log file 000000000000000000000000

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Kaijiang Chen 2019-12-18 03:25:32 Fwd: weird long time query
Previous Message Kaijiang Chen 2019-12-18 03:23:37 Re: weird long time query