Re: archive status ".ready" files may be created too early

From: "Bossart, Nathan" <bossartn(at)amazon(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: archive status ".ready" files may be created too early
Date: 2019-12-13 21:24:36
Message-ID: 3AD716DF-53BA-45D6-A8C3-CA7ECD465DCF@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/12/19, 8:08 PM, "Kyotaro Horiguchi" <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn(at)amazon(dot)com> wrote in
>> Hi hackers,
>>
>> I believe I've uncovered a bug that may cause archive status ".ready"
>> files to be created too early, which in turn may cause an incorrect
>> version of the corresponding WAL segment to be archived.
>>
>> The crux of the issue seems to be that XLogWrite() does not wait for
>> the entire record to be written to disk before creating the ".ready"
>> file. Instead, it just waits for the last page of the segment to be
>> written before notifying the archiver. If PostgreSQL crashes before
>> it is able to write the rest of the record, it will end up reusing the
>> ".ready" segment at the end of crash recovery. In the meantime, the
>> archiver process may have already processed the old version of the
>> segment.
>
> Year, that can happen if the server restarted after the crash.
>
>> This issue seems to most often manifest as WAL corruption on standby
>> servers after the primary server has crashed because it ran out of
>> disk space.
>
> In the first place, it's quite bad to set restart_after_crash to on,
> or just restart crashed master in replication set. The standby can be
> incosistent at the time of master crash, so it should be fixed using
> pg_rewind or should be recreated from a base backup.
>
> Even without that archiving behavior, a standby may receive wal bytes
> inconsistent to the bytes from the same master just before crash. It
> is not limited to segment boundary. It can happen on every block
> boundary and could happen everywhere with more complecated steps.
>
> What you are calling as a "problem" seems coming from allowing the
> restart_after_crash behavior. On the other hand, as recommended in the
> documentation, archive_command can refuse overwriting of the same
> segment, but we don't impose to do that.
>
> As the result the patch doesn't seem to save anything than setting up
> and operating correctly.

Disregarding the behavior of standby servers for a minute, I think
that what I've described is still a problem for archiving. If the
segment is archived too early, point-in-time restores that require it
will fail. If the server refuses to overwrite existing archive files,
the archiver process may fail to process the "good" version of the
segment until someone takes action to fix it. I think this is
especially troubling for backup utilities like pgBackRest that check
the archive_status directory independently since it is difficult to
know if the segment is truly ".ready".

I've attached a slightly improved patch to show how this might be
fixed. I am curious what concerns there are about doing something
like it to prevent this scenario.

>> Another thing I am exploring is whether a crash in between writing the
>> last page of a segment and creating the ".ready" file could cause the
>> archiver process to skip processing it altogether. In the scenario I
>> mention earlier, the server seems to recreate the ".ready" file since
>> it rewrites a portion of the segment. However, if a WAL record fits
>> perfectly into the last section of the segment, I am not sure whether
>> the ".ready" file would be created after restart.
>
> Why that segment needs .ready after restart, even though nothing could
> be written to the old segment?

If a ".ready" file is never created for a segment, I don't think it
will be archived.

Nathan

Attachment Content-Type Size
fix_ready_file_v2.patch application/octet-stream 1.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-12-13 21:33:44 Re: archive status ".ready" files may be created too early
Previous Message Alvaro Herrera 2019-12-13 20:07:51 xlog.c variable pointlessly global