Quick Links

Re: [PATCH] pg_surgery: Fix WAL corruption from concurrent heap_force_kill

From:	Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
To:	Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: [PATCH] pg_surgery: Fix WAL corruption from concurrent heap_force_kill
Date:	2026-05-06 20:27:25
Message-ID:	CAFcNs+qNJqTErQDGVYQOa13eeGhku0AwQMs_prPJPg=qq6r9bg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, May 5, 2026 at 5:42 PM Zsolt Parragi <zsolt(dot)parragi(at)percona(dot)com>
wrote:
>
> Hello!
>
> I verified both the patch, and that the test case catches the bug
> without the patch, the overall change seems good to me.
>

Thanks for your review.

> + if (did_modify_vm)
> + INJECTION_POINT_CACHED("heap-force-kill-before-vm-wal", NULL);
> +
> if (did_modify_vm && RelationNeedsWAL(rel))
> + {
> log_newpage_buffer(vmbuf, false);
> + LockBuffer(vmbuf, BUFFER_LOCK_UNLOCK);
> + }
>
> Is the additional if intentional here? Based on the name it seems like
> it could be simply part of the next if, currently it also fires for
> unlogged tables.
>

My bad it is a leftover. Fixed.

> +# Give session 2 time to reach the VM buffer lock (or complete if
> +# unfixed). We cannot reliably detect blocking from Perl, so just
> +# sleep briefly.
> +use Time::HiRes qw(usleep);
> +usleep(500_000);
>
> Wouldn't this be possibly unstable on CI?
>

Yes it should be somewhat flaky. TBH, I'm not sure if the test will be
pushed with the fix, but anyway let's improve it.

> The following seems to work for me, both for detecting the issue in
> the unpatched version and to result in a quicker continuation in the
> patched version: (~1.55sec total execution time compared to ~2sec
> original)
>
> (with the patch, s2 wait for the loc, without the patch it finishes
> and becomes idle)
>
> my $s2 = $node->background_psql('postgres');
> my $s2_pid = $s2->query_safe(q{SELECT pg_backend_pid()});
> chomp $s2_pid;
>
> $s2->query_until(qr/starting_s2/,
> q(\echo starting_s2
> SELECT heap_force_kill('test_vm'::regclass, ARRAY['(1,1)']::tid[]);
> \echo s2_done
> ));
>
> use Time::HiRes qw(usleep);
> my $observed = '';
> my $deadline = time() + 10;
> while (time() < $deadline) {
> $observed = $node->safe_psql('postgres', qq{
> SELECT format('%s/%s/%s',
> coalesce(wait_event_type, 'NULL'),
> coalesce(wait_event, 'NULL'),
> coalesce(state, 'NULL'))
> FROM pg_stat_activity WHERE pid = $s2_pid
> });
> last if $observed =~ m{^Buffer/BufferExclusive/}
> || $observed =~ m{/idle$};
> usleep(50_000);
> }

Thanks, see the newly attached version (v2) with proposed changes.

Regards,

--
Fabrízio de Royes Mello
On behalf of PlanetScale

Attachment	Content-Type	Size
v2-0001-pg_surgery-Fix-WAL-corruption-from-concurrent-hea.patch	application/octet-stream	15.4 KB

In response to

Re: [PATCH] pg_surgery: Fix WAL corruption from concurrent heap_force_kill at 2026-05-05 20:42:43 from Zsolt Parragi

Responses

Re: [PATCH] pg_surgery: Fix WAL corruption from concurrent heap_force_kill at 2026-05-06 23:54:36 from Melanie Plageman

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2026-05-06 21:31:46	Re: problems with toast.* reloptions
Previous Message	Jacob Champion	2026-05-06 18:56:35	Re: Why is_admin_of_role() use ROLERECURSE_MEMBERS rather than ROLERECURSE_PRIVS?