From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Daniel Gustafsson <daniel(at)yesql(dot)se> |
Cc: | Bernd Helmle <mailings(at)oopsware(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Michael Banck <mbanck(at)gmx(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Changing the state of data checksums in a running cluster |
Date: | 2025-08-25 23:06:24 |
Message-ID: | 47d946a3-c1c7-421d-a2b1-6a51cc329e6c@vondra.me |
Views: | Whole Thread | Raw Message | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/25/25 20:32, Daniel Gustafsson wrote:
>> On 20 Aug 2025, at 16:37, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>> This happens quite regularly, it's not hard to hit. But I've only seen
>> it to happen on a FSM, and only right after immediate shutdown. I don't
>> think that's quite expected.
>>
>> I believe the built-in TAP tests (with injection points) can't catch
>> this, because there's no concurrent activity while flipping checksums
>> on/off. It'd be good to do something like that, by running pgbench in
>> the background, or something like that.
>
> In searching for this bug I opted for implementing a version of the stress
> tests as a TAP test, see 006_concurrent_pgbench.pl in the attached patch
> version. It's gated behind PG_TEST_EXTRA since it's clearly not something
> which can be enabled by default (if this goes in this need to be re-done to
> provide two levels IMO, but during testing this is more convenient). I'm
> curious to see which improvements you can think to make it stress the code to
> the breaking point.
>
I think this TAP looks very nice, but there's a couple issues with it.
See the attached patch fixing those.
1) I think test_checksums should be in src/test/modules/Makefile?
2) The test_checksums/Makefile didn't seem to work for me, I was getting
Makefile:23: *** recipe commences before first target. Stop.
Because there was a missing "\" so I had to fix that. And then it was
complaining about Makefile.global or something, so I fixed that by
cargo-culting what other Makefiles in test modules do. Now it seems to
work for me. I guess you're on meson?
3) I'm no perl expert, but AFAICS the test wasn't really running the
pgbench, for a couple of reasons. It was passing "-q" to pgbench, but
that's only for initialization. The clusters had max_connections=10, but
the pgbench was using "-c 10", so I was getting "too many connections".
It was not setting "$pgbench_running = 1" so the other loops were
getting "too many connections" too. Another thing is I'm not sure it's
OK to pass '' to IPC::Run::start, I think it'll take it as an argument,
confusing pgbench.
With these changes it runs for me, and I even saw some
LOG: page verification failed
in tmp_check/log/006_concurrent_pgbench_standby_1.log. But it takes a
while - a couple minutes, maybe? I think I saw it at
t/006_concurrent_pgbench.pl .. 427/?
or something like that. I think the bash version did a couple things
differently, which might make the failures more frequent (but it's just
a wild guess).
In particular, I think the script restarts the two nodes independently,
while the TAP always stops both primary and standby, in this order. I
think it'd be useful to restart one or both.
The other thing is the bash script added some random delays/sleep, which
increases the test duration, but it also means generating somewhat
random amounts of data, etc. It also randomized some other stuff (scale,
client count, ...). But that can wait.
regards
--
Tomas Vondra
Attachment | Content-Type | Size |
---|---|---|
checksums-fixes.patch | text/x-patch | 4.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2025-08-25 23:12:45 | Re: Per backend relation statistics tracking |
Previous Message | Sami Imseih | 2025-08-25 22:51:38 | Re: Per backend relation statistics tracking |