Quick Links

Re: Why our Valgrind reports suck

From:	Yasir <yasir(dot)hussain(dot)shah(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Why our Valgrind reports suck
Date:	2025-05-09 10:40:08
Message-ID:	CAA9OW9e0OU83Gu-wCLggi8JUZx-98foei438Jysh3pcsCgbC8g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, May 9, 2025 at 7:04 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> A nearby thread [1] reminded me to wonder why we seem to have
> so many false-positive leaks reported by Valgrind these days.
> For example, at exit of a backend that's executed a couple of
> trivial queries, I see
>
> ==00:00:00:25.515 260013== LEAK SUMMARY:
> ==00:00:00:25.515 260013== definitely lost: 3,038 bytes in 90 blocks
> ==00:00:00:25.515 260013== indirectly lost: 4,431 bytes in 61 blocks
> ==00:00:00:25.515 260013== possibly lost: 390,242 bytes in 852 blocks
> ==00:00:00:25.515 260013== still reachable: 579,139 bytes in 1,457
> blocks
> ==00:00:00:25.515 260013== suppressed: 0 bytes in 0 blocks
>
> so about a thousand "leaked" blocks, all but a couple of which
> are false positives --- including nearly all the "definitely"
> leaked ones.
>
> Some testing and reading of the Valgrind manual [2] turned up a
> number of answers, which mostly boil down to us using very
> Valgrind-unfriendly data structures. Per [2],
>
> There are two ways a block can be reached. The first is with a
> "start-pointer", i.e. a pointer to the start of the block. The
> second is with an "interior-pointer", i.e. a pointer to the middle
> of the block.
>
> [ A block is reported as "possibly lost" if ] a chain of one or
> more pointers to the block has been found, but at least one of the
> pointers is an interior-pointer.
>
> We have a number of places that allocate space in such a way that
> blocks can only be reached by "interior pointers", leading to
> those blocks being reported as possibly lost:
>
> * MemoryContextAllocAligned does this more or less by definition.
>
> * Dynahash tables often end up looking like this, since the first
> element in each block created by element_alloc will be the tail end of
> its freelist, and thus will be reachable only via interior pointers
> later in the block, except when it's currently allocated.
>
> * Blocks that are reached via slists or dlists are like this
> unless the slist_node or dlist_node is at the front, which is not
> our typical practice.
>
> (There may be more cases, but those are what I identified in the
> leak report quoted above.)
>
> Another odd thing, which I have not found the explanation for, is
> that strings made by MemoryContextCopyAndSetIdentifier() show up
> as "definitely lost". This is nonsense, because they are surely
> referenced by the context's "ident" field; but apparently Valgrind
> isn't counting that for some reason.
>
> I'd be okay with using a suppression pattern to hide the
> MemoryContextCopyAndSetIdentifier cases, but that doesn't seem
> like a very palatable answer for these other cases: too much
> risk of hiding actual leaks.
>
> I don't see a way to avoid the problem for MemoryContextAllocAligned
> with the current context-type-agnostic implementation of that. We
> could probably fix it though if we pushed it down a layer, so that the
> alignment padding space could be treated as part of the chunk header.
> Might be able to waste less space that way, too.
>
> Dynahash tables could be fixed by expending a little extra storage
> to chain all the element-pool blocks together explicitly, which
> seems probably acceptable to do in USE_VALGRIND builds. (Maybe
> while we are at that, we could fix things so that currently-unused
> element slots are marked NOACCESS.)
>
> I don't have an answer for slist/dlist usages other than rearranging
> all the related structs. Anybody see a better way?
>
> Or we could do nothing, but I think there is value in having
> less clutter in Valgrind reports. Thoughts?
>
> regards, tom lane
>
> [1]
> https://www.postgresql.org/message-id/flat/CAA9OW9e4RbpgQd8NSzpW6BgJQNpKGEFoohWhkbEL8%3DZ5obvD0Q%40mail.gmail.com
> [2] https://valgrind.org/docs/manual/mc-manual.html#mc-manual.leaks
>
>
>
Thanks for the detailed analysis, this was very informative. I agree that
reducing noise in Valgrind reports would be valuable, especially for
catching real leaks. Having a clearer signal from Valgrind would definitely
help long term.

Regards,

Yasir
Data Bene

In response to

Why our Valgrind reports suck at 2025-05-09 02:04:06 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stepan Neretin	2025-05-09 10:57:35	Re: [PATCH] avoid double scanning in function byteain
Previous Message	Stepan Neretin	2025-05-09 10:37:39	Re: [PATCH] avoid double scanning in function byteain