Re: Tracing down buildfarm "postmaster does not shut down" failures

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: Re: Tracing down buildfarm "postmaster does not shut down" failures
Date: 2016-02-12 02:15:18
Message-ID: 20160212021518.GC268872@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 10, 2016 at 10:55:10AM -0500, Tom Lane wrote:
> Interestingly, we seem to have managed to greatly reduce the "other"
> time (which I presume is basically mdpostchkpt unlinking) since 9.2.
> The worst case observed in HEAD is about 100s:
>
> regression=# select ts,write,sync,total-write-sync as other, total from chkpts order by total-write-sync desc limit 10;
> ts | write | sync | other | total
> ----------------------------+---------+---------+---------+---------
> 2016-02-09 08:27:04.365-08 | 14.479 | 298.043 | 100.586 | 413.108
> 2016-02-09 06:15:22.477-08 | 75.099 | 26.590 | 90.452 | 192.141
> 2016-02-09 16:31:34.661-08 | 254.462 | 186.246 | 51.541 | 492.249
> 2016-02-09 15:25:01.586-08 | 0.015 | 2.985 | 47.822 | 50.822
> 2016-02-09 11:40:13.565-08 | 287.327 | 314.218 | 44.059 | 645.604
> 2016-02-09 08:56:51.339-08 | 135.464 | 31.242 | 39.131 | 205.837
> 2016-02-09 20:23:52.797-08 | 1.309 | 12.155 | 36.350 | 49.814
> 2016-02-09 00:51:41.157-08 | 134.502 | 103.482 | 34.110 | 272.094
> 2016-02-09 19:28:33.154-08 | 0.575 | 20.434 | 33.803 | 54.812
> 2016-02-09 03:47:02.57-08 | 0.066 | 37.475 | 33.119 | 70.660
> (10 rows)
>
> but 9.2's worst cases greatly exceed that:
>
> regression=# select ts,write,sync,total-write-sync as other, total from chkpts92 order by total-write-sync desc limit 10;
> ts | write | sync | other | total
> ----------------------------+--------+---------+---------+---------
> 2016-02-09 03:27:12.904-08 | 48.761 | 124.453 | 299.877 | 473.091
> 2016-02-09 16:31:25.657-08 | 3.860 | 184.042 | 248.488 | 436.390
> 2016-02-08 23:33:36.16-08 | 66.904 | 397.653 | 229.020 | 693.577
> 2016-02-09 11:39:57.061-08 | 2.471 | 303.924 | 201.923 | 508.318
> 2016-02-09 07:37:44.48-08 | 3.317 | 1.494 | 158.159 | 162.970
> 2016-02-09 04:32:49.668-08 | 11.328 | 292.310 | 148.088 | 451.726
> 2016-02-09 12:37:43.52-08 | 17.620 | 333.983 | 141.172 | 492.775
> 2016-02-09 08:27:04.407-08 | 12.019 | 300.510 | 102.855 | 415.384
> 2016-02-09 18:17:32.193-08 | 90.370 | 25.369 | 63.626 | 179.365
> 2016-02-09 11:02:14.977-08 | 2.713 | 2.306 | 38.581 | 43.600
> (10 rows)
>
> Not real sure where that improvement came from. We could hardly
> be unlinking fewer files overall, and the bottleneck seems to be
> in the kernel, so what changed? The *average*, as opposed to worst
> case, checkpoint time has definitely grown since 9.2:

Good question; I don't know. I would have expected 9.2's xlog-driven
checkpoints to create more stability than HEAD's time-driven checkpoints.

On the AIX animals, I have now set PGCTLTIMEOUT=900 and removed the "-t 120"
arguments.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-02-12 02:30:37 Re: GinPageIs* don't actually return a boolean
Previous Message Kouhei Kaigai 2016-02-12 02:05:22 Re: Way to check whether a particular block is on the shared_buffer?