Re: PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory
Date: 2026-05-04 14:21:55
Message-ID: aeae820d-9a58-462e-b3aa-94526a031389@vondra.me
Views: Whole Thread | Raw Message | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/4/26 15:56, Andres Freund wrote:
> Hi,
>
> On 2026-05-04 10:18:31 +0200, Tomas Vondra wrote:
>> A couple of days ago we got a report regarding an incorrect shmem size
>> calculation in btree [1], leading to a buffer overflow / memory
>> corruption. Which became a much less common issue in our code, thanks to
>> valgrind and similar tools. But it took me a while to realize valgrind
>> won't catch this because we only instrument private memory (palloc et
>> al), while shmem is left alone.
>>
>> I was wondering if it's feasible to improve this. Attached is a trivial
>> patch that adjusts shm_toc.c to add a couple NOACCESS bytes after each
>> entry in the segment. It seems to do the trick - with this we get a
>> reasonable report (for the reproducer provided in the bug report, before
>> it got fixed by 748d871b7c) from valgrind, with invalid accesses. See
>> the attached .log for an example. It's much better than the confusing
>> crashes due to corrupted state.
>>
>> There's an issue, though. It seems the valgrind memory markings are not
>> shared between processes. The leader sets the shm_toc up, marks the
>> ranges as NOACCESS, and then checks it while accessing the memory. But
>> the parallel workers don't seem to see this, and so will produce no
>> reports. I'm assuming this is the case, because all the reports come
>> from the leader, never from the workers. Maybe there's a different
>> explanation (e.g. maybe it's just the leader touching the memory?).
>
> I assume the issue is just that the workers don't have the NOACCESS markers? I
> think you'd need to do them in every process using the shm_toc. Either by
> doing it in shm_toc_attach() or in shm_toc().
>

Right, but it'd also need to know how long the entry is. Which AFAIK it
does not. We don't want to redo the calculation in every worker, but we
might add a "len" field to the toc entry.

>
>
>> An alternative would be to do mprotect(), but unfortunately it requires
>> page-aligned ranges, which makes it somewhat useless for small overflows
>> of a couple bytes (like here). It should work cross-process, I think,
>
> It doesn't work across processes. Every process has their own mprotect "view".
>

Ah, OK. So I guessed wrong, and it has the same issue as NOACCESS.

>
>> FWIW the PoC patch adds a 32-byte chunk, not just a single byte. This is
>> intentional, because if the state is an array, it's quite possible the
>> invalid access steps over a fair number of bytes. I'm actually thinking
>> it should be even larger.
>>
>> This modifies just the dynamic shmem, but maybe we should do this even
>> for the "regular" shmem allocated at start. Issues in that would likely
>> cause crashes pretty quick (unlike the btree issue, which requires a
>> somewhat special reproducer), but a nice valgrind report helps.
>
> I can tell you from experience that no, it's not necessarily quickly caught.
> So yes, I think we should definitely do that.
>

Understood.

>
>
>> /*
>> * Allocate shared memory from a segment managed by a table of contents.
>> *
>> @@ -119,9 +127,19 @@ shm_toc_allocate(shm_toc *toc, Size nbytes)
>> }
>> vtoc->toc_allocated_bytes += nbytes;
>>
>> +#ifdef USE_VALGRIND
>> + vtoc->toc_allocated_bytes += NUM_NOACCESS_BYTES;
>> +#endif
>> +
>> SpinLockRelease(&toc->toc_mutex);
>>
>> - return ((char *) toc) + (total_bytes - allocated_bytes - nbytes);
>> +#ifdef USE_VALGRIND
>> + /* make the bytes at the end no-access */
>> + VALGRIND_MAKE_MEM_NOACCESS(((char *) toc) + (total_bytes - allocated_bytes - NUM_NOACCESS_BYTES),
>> + NUM_NOACCESS_BYTES);
>> +#endif
>> +
>> + return ((char *) toc) + (total_bytes - allocated_bytes - nbytes - NUM_NOACCESS_BYTES);
>> }
>
> The size is already rounded up by that point:
>
> /*
> * Make sure request is well-aligned. XXX: MAXALIGN is not enough,
> * because atomic ops might need a wider alignment. We don't have a
> * proper definition for the minimum to make atomic ops safe, but
> * BUFFERALIGN ought to be enough.
> */
> nbytes = BUFFERALIGN(nbytes);
>
> Which means that you'll often have a up to 32byte pad at the end of the
> (ALIGNOF_BUFFER=32) allocation already. I don't care about the waste, but the
> ALIGNOF_BUFFER padding will often prevent detecting smaller out-of-bounds
> accesses.
>

Good point. I forgot about this alignment rounding. I don't care about
the waste either (it'd be a bit silly in a valgrind build anyway), but
not catching smaller mistakes would not be great.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message SATYANARAYANA NARLAPURAM 2026-05-04 14:26:34 Re: [Bug]Vacuum full silently NULL out fast default columns
Previous Message Andrew Dunstan 2026-05-04 14:19:21 Re: [PATCH] Reject ENCODING option for COPY TO FORMAT JSON