Re: Slow shutdowns sometimes on RDS Postgres

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: <pgsql-general(at)lists(dot)postgresql(dot)org>
Cc: Christophe Pettus <xof(at)thebuild(dot)com>, Chris Williams <cswilliams(at)gmail(dot)com>, Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Subject: Re: Slow shutdowns sometimes on RDS Postgres
Date: 2018-09-14 19:11:06
Message-ID: 4c65f988-a5ee-53f1-2d58-476a5c244cd2@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 9/14/18 10:04, Christophe Pettus wrote:
> In our experience, it's actually quite common that an RDS shutdown (or
> even just applying parameter changes) can take a while. What's
> particularly concerning is that it's not predictable, and that can
> make it hard to schedule and manage maintenance windows. What we were
> told previously is that RDS queues the operations, and it can take a
> variable amount of time for the operation to be worked on from the
> queue. Is that not the case?

Thanks Christophe - even if it's not what Chris is running into, this is
is another good call-out.

It's important to distinguish here between the RDS parts and the
community PostgreSQL parts.  I think for this thread it's just worth
pointing out that RDS automation/tooling will report the database in a
"modifying" state until it completes its management operations, however
the actual database unavailability is much shorter.  RDS carefully
engineers their processes to minimize the actual database unavailability
itself.

Chris has run into a problem where the PostgreSQL processes did not shut
down, evidenced by the error messages he mentioned, and as a result his
database was actually unavailable to applications for an extended
period.  This is uncommon and concerning.

This isn't the right forum for discussing the RDS bits; lets take that
to the AWS forums.  It's not synchronous, but the time to complete
should absolutely be predictable within reasonable bounds depending on
the operation type. I don't know how anyone could use the platform
otherwise!  If anyone is unable to establish bounded expectations for
some automated operation, I'd strongly encourage starting a thread on
the AWS forums or opening a support ticket.

On 9/14/18 09:27, Adrian Klaver wrote:
> The thing is I do not remember any posts to this list mentioning the
> same problem on a platform outside RDS. A quick search seems to
> confirm that.
I've met folks from other large fleet operators at PG conferences. 
There are all kinds of stories we don't find on the lists yet.  :) 
Hopefully we're all getting better about closing the loop and sharing
stuff back - that's part of the value large fleet operators can and
should bring to the community.

>> I don't know about this specific incident, but I do know that the RDS
>> team has seen cases where a backend gets into a state (like a system
>> call) where it's not checking signals and thus doesn't receive or
>> process the postmaster's request to quit. We've seen these processes
>> delay shutdowns and also block recovery on streaming replicas.
>
> The particulars of that state?
For the cases I've heard about, we haven't yet caught things quickly
enough to get stack dumps.  So I don't think we have particulars yet.

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message ik 2018-09-14 19:11:13 Query act different when doing by hand and by using a driver in app
Previous Message Andreas Brandl 2018-09-14 19:01:54 commit timestamps and replication