Re: Changing the state of data checksums in a running cluster

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Changing the state of data checksums in a running cluster
Date: 2025-08-27 09:39:35
Message-ID: 0e6ce93e-57e4-43ed-b410-66876c125ffb@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/27/25 10:30, Daniel Gustafsson wrote:
>> On 26 Aug 2025, at 01:06, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>> I think this TAP looks very nice, but there's a couple issues with it.
>> See the attached patch fixing those.
>
> Thanks, I have incorporated (most of) your patch in the attached. I did keep
> the PG_TEST_EXTRA check for injection points though which I assume were removed
> out of mistake.
>

Yes, that was a mistake.

>> With these changes it runs for me, and I even saw some
>>
>> LOG: page verification failed
>>
>> in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
>> while - a couple minutes, maybe? I think I saw it at
>>
>> t/006_concurrent_pgbench.pl .. 427/?
>
> That's very interesting, I have been running it to timeout several times in a
> row without hitting any verification failures. Will keep running.
>

Just to be clear - I don't see any pg_checksums failures either. I only
see failures in the standby log, and I don't think the script checks
that (it probably should).

>> or something like that. I think the bash version did a couple things
>> differently, which might make the failures more frequent (but it's just
>> a wild guess).
>>
>> In particular, I think the script restarts the two nodes independently,
>> while the TAP always stops both primary and standby, in this order. I
>> think it'd be useful to restart one or both.
>
> Done in the attached, it will now randomly stop one or both or none. If the
> node is stopped I've added an offline pg_checksum step to validate the
> datafiles as a why-not test.
>
>> The other thing is the bash script added some random delays/sleep, which
>> increases the test duration, but it also means generating somewhat
>> random amounts of data, etc. It also randomized some other stuff (scale,
>> client count, ...). But that can wait.
>
> Added as well in a few places, maybe more can be sprinkled in.
>

Thanks. I'll take a look.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2025-08-27 09:41:00 Re: plan shape work
Previous Message Bertrand Drouvot 2025-08-27 09:28:40 Re: Improve LWLock tranche name visibility across backends