Quick Links

PostgreSQL shutdown modes

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	PostgreSQL shutdown modes
Date:	2022-04-01 17:22:05
Message-ID:	CA+TgmoYxs1dzDN5jc5rVJz236M0uOd6QA2JiY+1yb=BVYg8MgA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I think it's pretty evident that the names we've chosen for the
various PostgreSQL shutdown modes are pretty terrible, and maybe we
should try to do something about that. There is nothing "smart" about
a smart shutdown. The usual result of attempting a smart shutdown is
that the server never shuts down at all, because typically there are
going to be some applications using connections that are kept open
more or less permanently. What ends up happening when you attempt a
"smart" shutdown is that you've basically put the server into a mode
where you're irreversibly committed to accepting no new connections,
but because you have a connection pooler or something that keeps
connections open forever, you never shut down either. It is in effect
a denial-of-service attack on the database you're supposed to be
administering.

Similarly, "fast" shutdowns are not in any way fast. It is pretty
common for a fast shutdown to take many minutes or even tens of
minutes to complete. This doesn't require some kind of extreme
workload to hit; I've run into it during casual benchmarking runs.
It's very easy to have enough dirty data in shared buffers, or enough
dirty in the operating system cache that will have to be fsync'd in
order to complete the shutdown checkpoint, to make things take an
extremely long time. In some ways, this is an even more effective
denial-of-service attack than a smart shutdown. True, the database
will at some point actually finish shutting down, but in the meantime
not only will we not accept new connections but we'll evict all of the
existing ones. Good luck maintaining five nines of availability if
waiting for a clean shutdown to complete is any part of the process.
It might be smarter to initiate a regular (non-shutdown) checkpoint
first, without cutting off connections, and then when that finishes,
proceed as we do now. The second checkpoint will complete a lot
faster, so while the overall operation still won't be fast, at least
we'd be refusing connections for a shorter period of time before the
system is actually shut down and you can do whatever maintenance you
need to do.

"immediate" shutdowns aren't as bad as the other two, but they're
still bad. One of the big problems is that I encounter in this area is
that Oracle uses the name "immediate" shutdown to mean a normal
shutdown with a checkpoint allowing for a clean restart. Users coming
from Oracle are sometimes extremely surprised to discover that an
immediate shutdown is actually a server crash that will require
recovery. Even if you don't come from Oracle, there's really nothing
about the name of this shutdown mode that intrinsically makes you
understand that it's something you should do only as a last resort.
Who doesn't like things that are immediate? The problem with this
theory is that you make the shutdown quicker at the price of startup
becoming much, much slower, because the crash recovery is very likely
going to take a whole lot longer than the shutdown checkpoint would
have done.

I attach herewith a modest patch to rename these shutdown modes to
more accurately correspond to their actual characteristics.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment	Content-Type	Size
v1-0001-Give-our-various-shutdown-types-more-appropriate-.patch	application/octet-stream	54.7 KB

Responses

Re: PostgreSQL shutdown modes at 2022-04-01 18:35:11 from Justin Pryzby
Re: PostgreSQL shutdown modes at 2022-04-02 02:58:55 from Michael Paquier
Re: PostgreSQL shutdown modes at 2022-04-02 13:39:52 from chap

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2022-04-01 17:41:53	Re: unlogged sequences
Previous Message	Andres Freund	2022-04-01 17:21:50	Can we automatically add elapsed times to tap test log?