"could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

From: Thomas Crayford <tcrayford(at)salesforce(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery
Date: 2018-09-24 11:58:59
Message-ID: CAJgZ2Z7SB1ZFpPPUh6Lczw+HdLc9t2rwmxp7OOEa7+-hJqZtEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi there,

We've found (what seems like) a race condition in a replica server which is
receiving WAL from the archive. We are not 100% sure if it happens on
streaming replicas as well.
Specifically, we occasionally see an error like this:

May 20 09:56:14 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR:
could not open file "pg_wal/00000002000072B50000003A": No such file or
directory

Upon looking at logs around this, we can see that the recovery process had
just restored that file:

May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG:
restored log file "00000002000072B50000003A" from archive
May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code =
58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such
file or directory

(on this particular server, pid 9 is the recovery process, and pid 141 is
the checkpointer)

We've seen this across a range of server versions, including (in the last
30 days):

9.5.12
9.5.14
9.6.10
9.6.6
9.6.7
9.6.8
9.6.9
10.2
10.3
10.4
10.5

Upon digging into the code, we found (we think) the reason behind this
race. The recovery process can call unlink() on the file, whilst the
restartpoint attempts to recycle the segment. To quote from the docs:

In archive recovery or standby mode, the server periodically performs
*restartpoints*, which are similar to checkpoints in normal operation: the
server forces all its state to disk, updates the pg_control file to
indicate that the already-processed WAL data need not be scanned again, and
then recycles any old log segment files in the pg_wal directory.

Looking at the code, I think that the two racing functions are
RestoreArchivedFile, and CreateRestartPoint.

The former calls unlink on the wal segment, CreateRestartPoint does attempt
to do recycling on segments.

From our perspective this seems like a race condition - the recovery
process unlinks the file as the checkpointer tries to recycle the wal
segment. At this point the checkpointer panics and shuts down, causing a
hit to availability.

As far as we have seen, starting the process up after the crash allows the
server to continue making progress.

This seems like an extremely rare race - our data show it occurs once every
few hundred thousand wal restores from archive that Heroku Postgres does.

Thanks

Tom Crayford
Heroku Postgres

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2018-09-24 14:53:18 BUG #15397: perl error
Previous Message Tom Lane 2018-09-23 20:16:57 Re: BUG #15396: initdb emits wrong comment for range for effective_io_concurrency