Re: Apparent race condition in standby promotion

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Apparent race condition in standby promotion
Date: 2016-05-04 06:10:42
Message-ID: 20160504061042.GC2409136@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 25, 2016 at 02:09:27PM -0400, Tom Lane wrote:
> I'm looking at buildfarm member tern's recent failure:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-04-25%2001%3A08%3A08

> # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/data_standby_fTzy/pgdata -l /home/nm/farm/gcc32/HEAD/pgsql.build/src/bin/pg_rewind/tmp_check/log/001_basic_standby.log promote
> server promoting
> #
> Timed out while waiting for promotion of standby at RewindTest.pm line 166.
>
> But look at what's in the standby's log:
>
> LOG: database system was interrupted; last known up at 2016-04-25 01:58:39 UTC
> LOG: entering standby mode
> LOG: redo starts at 0/2000028
> LOG: consistent recovery state reached at 0/2000C00
> LOG: database system is ready to accept read only connections
> LOG: started streaming WAL from primary at 0/3000000 on timeline 1
> LOG: statement: SELECT NOT pg_is_in_recovery()
> ... 88 occurrences of that removed ...
> LOG: statement: SELECT NOT pg_is_in_recovery()
> LOG: received promote request
> FATAL: terminating walreceiver process due to administrator command
>
> The standby server didn't notice the promote request until AFTER
> the perl script had probed 90 times, once a second, for promotion
> to finish. What's up with that?

This resembles symptoms I studied last September.

> The other theory is that the startup process received the SIGUSR2
> but failed to act on it for a long time. It checks for that only
> in CheckForStandbyTrigger(), and the calls to that function are
> in assorted rather unprincipled places, which makes it hard to
> convince oneself that such a call will always happen soon.

Right. In every case, I caught the startup process taking too long in a
single XLOG_DBASE_CREATE replay. I don't suspect a bug; that is a costly
record to replay, especially given the machine's slow filesystem metadata
operations. I could reproduce such symptoms reliably, on any system, with the
attached demonstration patch. While we could hack things until the message
prints earlier, that won't make promotion finish earlier.

> We've seen previous examples of the same thing, eg
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-02-09%2019%3A16%3A09
> Noah raised the test's timeout from 30 seconds to 90 in response
> to that failure, but it looks to me like this was a mis-diagnosis
> of the problem.

I can understand you guessing the 2016-02-09 failure inspired that change, but
it was either a coincidence or merely reminded me to commit an already-written
change. I had performed burn-in runs, with timeouts raised to 900s, totalling
25177 executions of src/bin/pg_rewind/t/002_databases.pl. The longest promote
duration among those runs was 43s, so I doubled that quantity and rounded up.

For the 2016-04-25 failure, note this bit of 001_basic_master.log:

LOG: statement: CHECKPOINT
LOG: statement: SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';
... 39 duplicate lines removed ...
LOG: statement: SELECT pg_current_xlog_location() = write_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';
LOG: received immediate shutdown request

That's 41s just to copy WAL. The slowest of those 002_databases.pl runs used
7s, and the worst I've seen in 2579 runs of 001_basic.pl is 22s. The failed
run likely coincided with exceptional system load.

If this happens again anytime soon, I should triple (or so) that 90s timeout.

Attachment Content-Type Size
repro-slow-promote-v1.patch text/plain 750 bytes

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David G. Johnston 2016-05-04 06:11:34 Re: Make PG's "NOT NULL"s and attnotnull ("is_nullable") conform to SQL-2011
Previous Message Noah Misch 2016-05-04 06:03:56 Re: text search changes vs. binary upgrade