archive status ".ready" files may be created too early

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: archive status ".ready" files may be created too early
Date: 2019-12-12 22:50:20
Message-ID: CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi hackers,

I believe I've uncovered a bug that may cause archive status ".ready"
files to be created too early, which in turn may cause an incorrect
version of the corresponding WAL segment to be archived.

The crux of the issue seems to be that XLogWrite() does not wait for
the entire record to be written to disk before creating the ".ready"
file. Instead, it just waits for the last page of the segment to be
written before notifying the archiver. If PostgreSQL crashes before
it is able to write the rest of the record, it will end up reusing the
".ready" segment at the end of crash recovery. In the meantime, the
archiver process may have already processed the old version of the
segment.

This issue seems to most often manifest as WAL corruption on standby
servers after the primary server has crashed because it ran out of
disk space. I have attached a proof-of-concept patch
(ready_file_fix.patch) that waits to create any ".ready" files until
closer to the end of XLogWrite(). The patch is incorrect for a few
reasons, but I hope it helps illustrate the problem. I have also
attached another patch (repro_helper.patch) to be used in conjunction
with the following steps to reproduce the issue:

initdb .
pg_ctl -D . -o "-c archive_mode=on -c archive_command='exit 0'" -l log.log start
pgbench -i -s 1000 postgres
psql postgres -c "SELECT pg_switch_wal();"

With just repro_helper.patch applied, these commands should produce
both of the following log statements:

PANIC: failing at inconvenient time
LOG: status file already exists for "000000010000000000000017"

With both patches applied, the commands will only produce the first
PANIC statement.

Another thing I am exploring is whether a crash in between writing the
last page of a segment and creating the ".ready" file could cause the
archiver process to skip processing it altogether. In the scenario I
mention earlier, the server seems to recreate the ".ready" file since
it rewrites a portion of the segment. However, if a WAL record fits
perfectly into the last section of the segment, I am not sure whether
the ".ready" file would be created after restart.

I am admittedly in the early stages of working on this problem, but I
thought it would be worth reporting to the community early on in case
anyone has any thoughts on or past experiences with this issue.

Nathan

Attachment Content-Type Size
repro_helper.patch application/octet-stream 1.3 KB
ready_file_fix.patch application/octet-stream 1.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-12-12 23:19:01 Re: log bind parameter values on error
Previous Message Andres Freund 2019-12-12 22:32:52 Re: log bind parameter values on error