Re: pg_stop_backup does not complete

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pg_stop_backup does not complete
Date: 2010-02-23 18:58:22
Message-ID: 1266951502.3752.4294.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2010-02-23 at 09:45 -0800, Josh Berkus wrote:

> 1) Set up a brand new master with an archive-commmand and archive=on.
>
> 2) Start the master
>
> 3) Do a pg_start_backup()
>
> 4) Realize, based on log error messages, that I've misconfigured the
> archive_command.

> 5) Attempt to shut down the master. Master tells me that pg_stop_backup
> must be run in order to shut down.
>
> 6) Execute pg_stop_backup.
>
> 7) pg_stop_backup waits forever without ever stopping backup. Ever 60
> seconds, it give me a helpful "still waiting" message, but at least in
> the amount of time I was willing to wait (5 minutes), it never completed.
>
> 8) do an immediate shutdown, as it's the only way I can get the database
> unstuck.
>
> With some experimentation, the problem seems to occur when you have a
> failing archive_command and a master which currently has no database
> traffic; for example, if I did some database write activity (a createdb)
> then pg_stop_backup would complete after about 60 seconds (which, btw,
> is extremely annoying, but at least tolerable).
>
> This issue is 100% reproduceable.

IMHO there in no problem in that behaviour. If somebody requests a
backup then we should wait for it to complete. Kevin's suggestion of
pg_fail_backup() is the only sensible conclusion there because it gives
an explicit way out of deadlock.

ISTM the problem is that you didn't test. Steps 3 and 4 should have been
reversed. Perhaps we should put something in the docs to say "and test".
The correct resolution is to put in an archive_command that works.

We can put in an extra step to prevent a pg_start_backup() if there are
a significant number of outstanding files to be archived. Doing that
seems like closing the door after the horse has bolted, since we just
introduced streaming replication that doesn't rely on archived files. In
any case, I don't see many people working on a production system hitting
a problem on an archive_command and then deciding to shut down.

So I don't see this as something that needs fixing for 9.0. There is
already too much non-essential code there, all of which needs to be
tested. I don't think adding in new corner cases to "help" people makes
any sense until we have automated testing that allows us to rerun the
regression tests to check all this stuff still works.

--
Simon Riggs www.2ndQuadrant.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-02-23 19:02:58 Re: function side effects
Previous Message Tom Lane 2010-02-23 18:51:22 Re: [PATCH] backend: compare word-at-a-time in bcTruelen