walwriter can set XLP_BKP_REMOVABLE wrongly: race w/ backup start

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: walwriter can set XLP_BKP_REMOVABLE wrongly: race w/ backup start
Date: 2025-07-05 00:16:28
Message-ID: 20250705001628.c3.nmisch@google.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

During a post-commit review of bc22dc0 "Get rid of WALBufMappingLock", I got
suspicious of this older AdvanceXLInsertBuffer() code:

/*
* If online backup is not in progress, mark the header to indicate
* that WAL records beginning in this page have removable backup
* blocks. This allows the WAL archiver to know whether it is safe to
* compress archived WAL data by transforming full-block records into
* the non-full-block format. It is sufficient to record this at the
* page level because we force a page switch (in fact a segment
* switch) when starting a backup, so the flag will be off before any
* records can be written during the backup. At the end of a backup,
* the last page will be marked as all unsafe when perhaps only part
* is unsafe, but at worst the archiver would miss the opportunity to
* compress a few records.
*/
if (Insert->runningBackups == 0)
NewPage->xlp_info |= XLP_BKP_REMOVABLE;

No lock stops runningBackups from changing while walwriter runs that code.
The attached test case reproduces setting XLP_BKP_REMOVABLE on a page that
should not have it:

2025-07-04 16:35:17.322 PDT [1496038] PANIC: initialization of 0/1000000 not reflecting backup started at 0/F00028

The test patch is for v17, but a similar test reaches the same race condition
in master (equivalent to v18 for $SUBJECT purposes, I bet).

Nothing in core reacts to XLP_BKP_REMOVABLE, and I'm not aware of non-core
code that reacts to it and is still maintained.
https://codesearch.debian.net/search?q=XLP_BKP_REMOVABLE&literal=1&perpkg=1
finds none, and 2016-2018 discussion
postgr.es/m/flat/579297F8.7020107%40anastigmatix.net found none. I wonder if
ceasing to set XLP_BKP_REMOVABLE for v19 is the best fix.

I don't plan to address this myself, so I'm leaving this here.

Thanks,
nm

Attachment Content-Type Size
repro-XLP_BKP_REMOVABLE-race-v0.patch17 text/plain 4.5 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-07-05 01:11:32 Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
Previous Message Alexander Lakhin 2025-07-04 20:00:00 Re: stats.sql might fail due to shared buffers also used by parallel tests