Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Date: 2016-11-22 20:34:13
Message-ID: 20161122203413.qbad4jrcgevkzdnk@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas wrote:
> On Tue, Nov 22, 2016 at 1:37 PM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
> >> > Yes, I am, and I disagree with you. The current decision on this point
> >> > was made ages ago, before autovacuum even existed let alone relied on
> >> > the stats for proper functioning. The tradeoff you're saying you're
> >> > okay with is "we'll shut down a few seconds faster, but you're going
> >> > to have table bloat problems later because autovacuum won't know it
> >> > needs to do anything". I wonder how many of the complaints we get
> >> > about table bloat are a consequence of people not realizing that
> >> > "pg_ctl stop -m immediate" is going to cost them.
> >>
> >> That would be useful information to have, but I bet the answer is "not
> >> that many". Most people don't shut down their database very often;
> >> they're looking for continuous uptime. It looks to me like autovacuum
> >> activity causes the statistics files to get refreshed at least once
> >> per autovacuum_naptime, which defaults to once a minute, so on the
> >> average we're talking about the loss of perhaps 30 seconds worth of
> >> statistics.
> >
> > I think you're misunderstanding how this works. Losing that file
> > doesn't lose just the final 30 seconds worth of data -- it loses
> > *everything*, and every counter goes back to zero. So it's not a few
> > parts-per-million, it loses however many millions there were.
>
> OK, that's possible, but I'm not sure. I think there are two separate
> issues here. One is whether we should nuke the stats file on
> recovery, and the other is whether we should force a final write of
> the stats file before agreeing to an immediate shutdown. It seems to
> me that the first one affects whether all of the counters go to zero,
> and the second affects whether we lose a small amount of data from
> just prior to the shutdown. Right now, we are doing the first, so the
> second is a waste. If we decide to start doing the first, we can
> independently decide whether to also do the second.

Well, the problem is that the stats data is not on disk while the system
is in operation, as far as I recall -- it's only in the collector's
local memory. On shutdown we tell it to write it down to a file, and on
startup we tell it to read it from the file and then delete it. I think
the rationale for this is to avoid leaving a file with stale data on
disk while the system is running.

> > Those writes are slow because of the concurrent activity. If all
> > backends just throw their hands in the air, no more writes come from
> > them, so the OS is going to finish the writes pretty quickly (or at
> > least empty enough of the caches so that the pgstat data fits); so
> > neither (1) nor (3) should be terribly serious. I agree that (2) is a
> > problem, but it's not a problem for everyone.
>
> If the operating system buffer cache doesn't contain much dirty data,
> then I agree. But there is a large backlog of dirty data there, then
> it might be quite slow.

That's true, but if the system isn't crashing, then writing a bunch of
pages would make room for the pgstat data to be written to the OS, which
is enough (we request only a write, not a flush, as I recall). So we
don't need to wait for a very long period.

> > A fast shutdown is not all that fast -- it needs to write the whole
> > contents of shared buffers down to disk, which may be enormous.
> > Millions of times bigger than pgstat data. So a fast shutdown is
> > actually very slow in a large machine. An immediate shutdown, even if
> > it writes pgstat data, is still going to be much smaller in terms of
> > what is written.
>
> I agree. However, in many cases, the major cost of a fast shutdown is
> getting the dirty data already in the operating system buffers down to
> disk, not in writing out shared_buffers itself. The latter is
> probably a single-digit number of gigabytes, or maybe double-digit.
> The former might be a lot more, and the write of the pgstat file may
> back up behind it. I've seen cases where an 8kB buffered write from
> Postgres takes tens of seconds to complete because the OS buffer cache
> is already saturated with dirty data, and the stats files could easily
> be a lot more than that.

In the default config, background flushing is invoked when memory is 10%
dirty (dirty_background_ratio); foreground flushing is forced when
memory is 40% dirty (dirty_ratio). That means the pgstat process can
dirty 30% additional memory before being forced to perform flushing.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-11-22 20:47:30 Re: patch: function xmltable
Previous Message Tom Lane 2016-11-22 20:29:40 Re: dblink get_connect_string() passes FDW option "updatable" to the connect string, connection fails.