| From: | Palak Chaturvedi <chaturvedipalak1911(at)gmail(dot)com> |
|---|---|
| To: | Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-hackers(at)postgresql(dot)org |
| Subject: | Re: Changing the state of data checksums in a running cluster |
| Date: | 2026-07-02 05:06:21 |
| Message-ID: | CALfch1-earbZTedkbb_3=PgywzBVAc1FXZfz834bb0DKZGDEXw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi Tomas,
Thanks for these test patches. They are quite useful. I've been going
through them because we think they'll also help with the TAP tests for the
online shared_buffers resize work , where I'm ending up reusing several of
the same helpers (attach_injection_points, background_rw_pgbench, and a
couple of others).
Would you be open to lifting these helpers out into a shared Perl module
(for example under src/test/perl/PostgreSQL/Test/…) so both test suites can
consume them instead of each copy-pasting? I'm happy to prepare a small
prep patch on top of your latest version if that sounds reasonable to you.
Thanks,
Palak
On Mon, 4 May 2026 at 18:47, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Hi,
>
> Thanks for getting this feature pushed, and for resolving the failures
> reported since the feature freeze. I consider this to be an important
> improvement, not just for the feature itself, but also because of all
> the useful infrastructure it added.
>
> Attached is a refined version of the TAP tests already posted by Daniel
> some time ago [1]. Unfortunately, that .txt did not apply cleanly for
> some reason, so here's a better version.
>
> I found these tests quite useful when reasoning about how the patch
> behaves in concurrent environment (e.g. with multiple sessions
> triggering checksum enable/disable, or with a checkpoint, crashes, etc).
>
> At this point all the tests pass, but there are a couple cases with
> correct but slightly surprising behavior, worth discussing. Which is
> what this e-mail is going to be about.
>
> I'll explain what the TAP tests aim to do first, and then discuss the
> slightly surprising behavior.
>
> It's not meant for inclusion into PG19, at least not in this shape - I
> wrote those TAP tests while investigating some of the earlier failures
> and/or when wondering about behavior in various situations (sequence of
> concurrent steps, race conditions, ...). So it's more of an exhaustive,
> and the tests are somewhat redundant (N+1 is often just (N + some small
> tweak)).
>
> I can imagine distilling it into a tiny subset, and adding that. But
> that's up to discussion. But that's for later.
>
>
> Let me briefly explain what the various TAP tests aim to do. From the
> very beginning, my main concern regarding this patch was race conditions
> when updating the shared state about effective data_checksum_version.
> Because the state is effectively split into about three or four places:
>
> * LocalDataChecksumVersion (local cache)
> * XLogCtl->data_checksum_version (XLogCtl->info_lck)
> * ControlFile->data_checksum_version (ControlFileLock)
> * state in control file on disk
>
> These pieces are protected by different locks, the protocol for updating
> and/or reading the various flags is not trivial (and some of the fixed
> issues were due to ControlFile->data_checksum_version being updated from
> a place that shouldn't have).
>
> So the primary goal of the TAP tests was to check for race conditions by
> leveraging injection points to step through concurrent processes in a
> deterministic way. The first couple patches (0001-0004) add debug
> logging and injection points into a lot of places. And by "a lot" I mean
> ~80 new injection points, which is about the number of injection points
> we have in master now. Anyway, this allows stepping through concurrent
> checksum changes, and also checksum change vs. checkpointer.
>
> Then come the actual TAP tests:
>
> 1) 0005-TAP-10-concurrent-checksum-changes.patch
>
> Two concurrent checksum changes. The first one gets paused at an
> injection point, then the second one gets initiated.
>
> 2) 0006-TAP-11-concurrency-with-checkpoints.patch
>
> A checksum change + checkpoint. The change gets paused at an injection
> point, a synchronous checkpoint is performed.
>
> 3) 0007-TAP-12-crashes-at-injection-points.patch
>
> Similar to 0006, but with a crash + recovery. A checksum change gets
> paused at an injection point, a synchronous checkpoint is performed. The
> changes gets wpken up and either completes, or pauses on a different
> injection point. A restart/crash happens.
>
> 4) 0008-TAP-13-concurrency-with-checkpoint-REDO.patch
>
> Similar to 0007, but the checkpoint is not synchronous - happens in the
> background, so that the TAP can step through both sides and interleave
> them in an arbitrary way. This matters because the checksum change
> updates the different state pieces (XlogCtl/ControlFile), while the
> checkpointer reads them to record initial state for REDO etc.
>
> 5) 0009-TAP-14-checkpoints-with-crashes.patch
>
> Similar to 0008, except that the steps are more fine grained, and
> focused on two particular cases with surprisingly different final state.
>
>
> AFAIK everything works as expected, except for two cases in the "TAP
> 012" test. One for the "enabling" direction, one for the "disabling"
> direction. I'm going to discuss the "enabling" direction, I believe the
> other case is just a mirror with the same root cause.
>
> The TAP 012 tests checksum change with a concurrent checkpoint, followed
> by a crash, and tests the final state. It pauses the change at an
> injection point, does a checkpoint, proceeds to the next injection
> point, crashes and does recovery. The expectation is that the final
> state "flips" at some injection point, once it gets further enough, and
> stays there. But what actually happens is this:
>
> a) test_checksum_transition(
> 'disabled', 'enable', undef,
> 'datachecksums-enable-inprogress-checksums-end',
> 'datachecksums-enable-checksums-start',
> 'off');
>
> b) test_checksum_transition(
> 'disabled', 'enable', undef,
> 'datachecksums-enable-checksums-start',
> 'datachecksums-enable-checksums-after-xlog',
> 'on');
>
> c) test_checksum_transition(
> 'disabled', 'enable', 'datachecksums-enable-checksums-start',
> 'datachecksums-enable-checksums-after-xlogs',
> 'datachecksums-enable-checksums-after-xlogctl',
> 'off');
>
> This says that if the checkpoint happens after
> 'datachecksums-enable-inprogress-checksums-end' or after
> 'datachecksums-enable-checksums-after-xlog', we end up with 'off' (i.e.
> enabling checksums fails).
>
> But if the checkpoint happens after
> 'datachecksums-enable-checksums-start', we end up with "on" (after
> recovery).
>
> This is a bit surprising, because that injection point is before
> 'datachecksums-enable-checksums-after-xlog'. So the enabling process
> gets further and further, but the final state flips off -> on -> off,
> contradicting the expectation that it changes once.
>
> I haven't quite wrapped my head around it yet, but my understanding is
> this is due to a race condition between the checksum launcher (writing
> XLOG2_CHECKSUMS and updating the shmem state), and the checkpointer
> (reading the shmem state and generating REDO).
>
> The launcher does this sequence of steps:
>
> 1) write XLOG2_CHECKSUMS with new state
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> while the checkpointer (CreateCheckPoint) does this:
>
> A) read XLogCtl->data_checksum_version (while holding insert locks)
> B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
> C) UpdateControlFile()
>
> The outcome depends on how exactly these two sequences interleave. For
> example, this can happen:
>
> 1) write XLOG2_CHECKSUMS with new state
> A) read XLogCtl->data_checksum_version (while holding insert locks)
> B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
> C) UpdateControlFile()
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> Which means the XLOG_CHECKPOINT_REDO will be after XLOG2_CHECKSUMS (and
> so redo won't see it), but the checkpoint will still get the old
> checksum state from XLogCtl. And so the outcome is "off", per case (c).
>
> But it can also happen what case (b) does:
>
> A) read XLogCtl->data_checksum_version (while holding insert locks)
> B) insert XLOG_CHECKPOINT_REDO (reads XLogCtl->data_checksum_version)
> C) UpdateControlFile()
> 1) write XLOG2_CHECKSUMS with new state
> 2) update XLogCtl->data_checksum_version
> 3) update ControlFile->data_checksum_version
> 4) UpdateControlFile()
> 5) emits barrier
>
> In which case the REDO will have the old state, but the recovery will
> read the XLOG2_CHECKSUMS, and so end up with "on".
>
> This is the root cause of the surprising behavior in TAP 012, I think.
>
> I attempted to trigger these race conditions in TAP 013, but without
> much success. In the end I realized it probably needs more control,
> waiting for the other process to hit the next injection point before
> unpausing the current one. TAP 014 does that, and it shows that with the
> right interleaving of steps the (c) case can end up with both "on" and
> "off" final state.
>
> As I said, I don't claim I fully understand this yet. But I wouldn't
> call this "bug" - AFAICS it won't produce an incorrect final state (I
> haven't seen any such cases).
>
> Still, I wonder if there's a potential issue I failed to notice.
>
>
> The other question I had when looking at this (concurrency with
> checkpoints) is what we get by doing
>
> MyProc->delayChkptFlags |= DELAY_CHKPT_START;
>
> whenever updating the state in SetDataChecksums... functions. Because
> the only thing that guarantees is the updates happen on one side of the
> checkpoint record. What does that give us, actually?
>
> It does not seem to prevent this surprising behavior, and it does not
> say the XLOG2_CHECKSUMS happens before/after the XLOG_CHECKPOINT_REDO.
>
>
> regards
>
> [1]
>
> https://www.postgresql.org/message-id/9197F930-DDEB-4CAC-82A2-16FEC715CCE8%40yesql.se
>
> --
> Tomas Vondra
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ewan Young | 2026-07-02 05:27:00 | Re: Fix jsonpath .decimal() to honor silent mode |
| Previous Message | Bertrand Drouvot | 2026-07-02 05:02:36 | Re: Prevent crash when calling pgstat functions with unregistered stats kind |