Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit

From: Baji Shaik <baji(dot)pgdev(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)kurilemu(dot)de>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit
Date: 2026-05-18 00:34:20
Message-ID: CA+fm-RN2oL=mWXA6NLKNS5wrCXLML-z+O6euKW1ga2ZEYFVQ0Q@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alvaro, Sami,

Thank you both for the feedback.

Sami, thanks for identifying the use-after-free with on_proc_exit
and providing the v2 patch with before_shmem_exit. That helped
narrow down the right approach.

On Sun, May 17, 2026 at 3:46 PM Alvaro Herrera <alvherre(at)kurilemu(dot)de> wrote:

> I think a better fix for this is to use PG_ENSURE_ERROR_CLEANUP(). That
> way we avoid leaving callbacks in place, which would not be great if the
> same backend does a lot of REPACKs: after a dozen or so, it dies with
>
> FATAL: out of before_shmem_exit slots
>
>
Alvaro, you're right that PG_ENSURE_ERROR_CLEANUP is the better
approach here. Attached is v3 which uses it instead of
before_shmem_exit or on_proc_exit.

Summary of the issues with v1 and v2:

- v1 (on_proc_exit): Crashes with assertions enabled because
memory contexts are already destroyed by the time on_proc_exit
callbacks run. The worker handle is clobbered with 0x7f
(CLOBBER_FREED_MEMORY).

- v2 (before_shmem_exit): Works correctly for a single REPACK,
but leaks a callback slot on each successful REPACK
(CONCURRENTLY). After ~15 REPACKs in the same session:
FATAL: out of before_shmem_exit slots.

v3 uses PG_ENSURE_ERROR_CLEANUP which:
- Handles both ERROR and FATAL exits
- Automatically cancels the callback on normal completion
(no slot leak)
- Runs before memory contexts are destroyed (no use-after-free)

The existing PG_TRY/PG_FINALLY block is replaced with
PG_ENSURE_ERROR_CLEANUP for the concurrent path, and a plain
rebuild_relation() call for the non-concurrent path.

Tested with --enable-cassert --enable-debug --enable-injection-points:
- All 245 regression tests pass (including cluster)
- All 8 injection point tests pass (including repack and repack_toast)
- pg_terminate_backend cleans up worker and slot without crash
- 20 REPACK (CONCURRENTLY) in same session completes without
slot exhaustion

I have not added a dedicated regression test for the
pg_terminate_backend scenario yet, but I can write one using
injection points if needed.

Thanks,
Baji Shaik

Attachment Content-Type Size
v3-0001-Fix-REPACK-decoding-worker-not-cleaned-up-on-FATAL-exit.patch application/octet-stream 3.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Junwang Zhao 2026-05-18 02:07:02 Re: (SQL/PGQ) cache lookup failed for label
Previous Message Ashutosh Bapat 2026-05-18 00:22:19 Re: (SQL/PGQ) cache lookup failed for label