Re: [RFC] Should we fix postmaster to avoid slow shutdown?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Date: 2016-11-22 20:49:27
Message-ID: CA+Tgmobn-PgGu7OZEoeHNHuVg7ZyaJgROP=CAiPn575+LeC7ag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 22, 2016 at 3:34 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
>> OK, that's possible, but I'm not sure. I think there are two separate
>> issues here. One is whether we should nuke the stats file on
>> recovery, and the other is whether we should force a final write of
>> the stats file before agreeing to an immediate shutdown. It seems to
>> me that the first one affects whether all of the counters go to zero,
>> and the second affects whether we lose a small amount of data from
>> just prior to the shutdown. Right now, we are doing the first, so the
>> second is a waste. If we decide to start doing the first, we can
>> independently decide whether to also do the second.
>
> Well, the problem is that the stats data is not on disk while the system
> is in operation, as far as I recall -- it's only in the collector's
> local memory. On shutdown we tell it to write it down to a file, and on
> startup we tell it to read it from the file and then delete it. I think
> the rationale for this is to avoid leaving a file with stale data on
> disk while the system is running.

/me tests.

I think you are almost right. When the server is running, there are
files in pg_stat_tmp but not pg_stat; when it is shut down, there are
files in pg_stat but not pg_stat_tmp. Of course the data can never be
ONLY in the collector's backend-local memory because then nobody else
could read it.

I don't actually really understand the reason for this distinction.
If it's important not to lose the data, then the current system is the
worst of all possible worlds, and it would be better to have only only
one file and atomically rename() a new one in over top of it from time
to time. If we did that and also committed the proposed patch, it
would be only slightly worse than if we did only that. Wouldn't it?

>> > Those writes are slow because of the concurrent activity. If all
>> > backends just throw their hands in the air, no more writes come from
>> > them, so the OS is going to finish the writes pretty quickly (or at
>> > least empty enough of the caches so that the pgstat data fits); so
>> > neither (1) nor (3) should be terribly serious. I agree that (2) is a
>> > problem, but it's not a problem for everyone.
>>
>> If the operating system buffer cache doesn't contain much dirty data,
>> then I agree. But there is a large backlog of dirty data there, then
>> it might be quite slow.
>
> That's true, but if the system isn't crashing, then writing a bunch of
> pages would make room for the pgstat data to be written to the OS, which
> is enough (we request only a write, not a flush, as I recall). So we
> don't need to wait for a very long period.

I'm not sure what you are saying here. Of course, if the OS writes
pages to disk then there will be room in the buffer cache for more
dirty pages. The issue is whether this will unduly delay shutdown.

>> > A fast shutdown is not all that fast -- it needs to write the whole
>> > contents of shared buffers down to disk, which may be enormous.
>> > Millions of times bigger than pgstat data. So a fast shutdown is
>> > actually very slow in a large machine. An immediate shutdown, even if
>> > it writes pgstat data, is still going to be much smaller in terms of
>> > what is written.
>>
>> I agree. However, in many cases, the major cost of a fast shutdown is
>> getting the dirty data already in the operating system buffers down to
>> disk, not in writing out shared_buffers itself. The latter is
>> probably a single-digit number of gigabytes, or maybe double-digit.
>> The former might be a lot more, and the write of the pgstat file may
>> back up behind it. I've seen cases where an 8kB buffered write from
>> Postgres takes tens of seconds to complete because the OS buffer cache
>> is already saturated with dirty data, and the stats files could easily
>> be a lot more than that.
>
> In the default config, background flushing is invoked when memory is 10%
> dirty (dirty_background_ratio); foreground flushing is forced when
> memory is 40% dirty (dirty_ratio). That means the pgstat process can
> dirty 30% additional memory before being forced to perform flushing.

There's absolutely no guarantee that we aren't already hard up against
that 40% threshold - or whatever it is - already. Backends write data
all the time, and flushes only happen at checkpoint time. Indeed, I'd
argue that if somebody is performing an immediate shutdown, the most
likely reason is that a fast shutdown is too slow. And the most
likely reason for that is that the operating system buffer cache is
full of dirty data.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-11-22 20:52:22 Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Previous Message Michael Banck 2016-11-22 20:48:09 Re: [PATCH] Reload SSL certificates on SIGHUP