Re: Getting better results from valgrind leak tracking

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Getting better results from valgrind leak tracking
Date: 2021-03-18 21:51:28
Message-ID: 3816764.1616104288@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> There's plenty other hits, but I think I should get back to working on
>> making the shared memory stats patch committable. I really wouldn't want
>> it to slip yet another year.

> +1, so far there's little indication that we're finding any serious leaks
> here. Might be best to set it all aside till there's more time.

Well, I failed to resist the temptation to keep poking at this issue,
mainly because I felt that it'd be a good idea to make sure we'd gotten
our arms around all of the detectable issues, even if we choose not to
fix them right away. The attached patch, combined with your preceding
memory context patch, eliminates all but a very small number of the leak
complaints in the core regression tests.

The remaing leak warnings that I see are:

1. WaitForReplicationWorkerAttach leaks the BackgroundWorkerHandle it's
passed. I'm not sure which function should clean that up, but in any
case it's only 16 bytes one time in one process, so it's hardly exciting.

2. There's lots of leakage in text search dictionary initialization
functions. This is hard to get around completely, because the API for
those functions has them being called in the dictionary's long-lived
context. In any case, the leaks are not large (modulo comments below),
and they would get cleaned up if the dictionary cache were thrown away.

3. partition_prune.sql repeatably produces this warning:

==00:00:44:39.764 3813570== 32,768 bytes in 1 blocks are possibly lost in loss record 2,084 of 2,096
==00:00:44:39.764 3813570== at 0x4C30F0B: malloc (vg_replace_malloc.c:307)
==00:00:44:39.764 3813570== by 0x94315A: AllocSetAlloc (aset.c:941)
==00:00:44:39.764 3813570== by 0x94B52F: MemoryContextAlloc (mcxt.c:804)
==00:00:44:39.764 3813570== by 0x8023DA: LockAcquireExtended (lock.c:845)
==00:00:44:39.764 3813570== by 0x7FFC7E: LockRelationOid (lmgr.c:116)
==00:00:44:39.764 3813570== by 0x5CB89F: findDependentObjects (dependency.c:962)
==00:00:44:39.764 3813570== by 0x5CBA66: findDependentObjects (dependency.c:1060)
==00:00:44:39.764 3813570== by 0x5CBA66: findDependentObjects (dependency.c:1060)
==00:00:44:39.764 3813570== by 0x5CCB72: performMultipleDeletions (dependency.c:409)
==00:00:44:39.764 3813570== by 0x66F574: RemoveRelations (tablecmds.c:1395)
==00:00:44:39.764 3813570== by 0x81C986: ProcessUtilitySlow.isra.8 (utility.c:1698)
==00:00:44:39.764 3813570== by 0x81BCF2: standard_ProcessUtility (utility.c:1034)

which evidently is warning that some LockMethodLocalHash entry is losing
track of its lockOwners array. But I sure don't see how that could
happen, nor why it'd only happen in this test. Could this be a false
report?

As you can see from the patch's additions to src/tools/valgrind.supp,
I punted on the issues around pl/pgsql's function-compile-time leaks.
We know that's an issue, but again the leaks are fairly small and
they are confined within function cache entries, so I'm not sure
how hard we should work on that.

(An idea for silencing both this and the dictionary leak warnings is
to install an on-proc-exit callback to drop the cached objects'
contexts.)

Anyway, looking through the patch, I found several non-cosmetic issues:

* You were right to suspect that there was residual leakage in guc.c's
handling of error cases. ProcessGUCArray and call_string_check_hook
are both guilty of leaking malloc'd strings if an error in a proposed
GUC setting is detected.

* I still haven't tried to wrap my brain around the question of what
RestoreGUCState really needs to be doing. I have, however, found that
check-world passes just fine with the InitializeOneGUCOption calls
diked out entirely, so that's what's in this patch.

* libpqwalreceiver.c leaks a malloc'd error string when
libpqrcv_check_conninfo detects an erroneous conninfo string.

* spell.c leaks a compiled regular expression if an ispell dictionary
cache entry is dropped. I fixed this by adding a memory context reset
callback to release the regex. This is potentially a rather large
leak, if the regex is complex (though typically it wouldn't be).

* BuildEventTriggerCache leaks working storage into
EventTriggerCacheContext.

* Likewise, load_domaintype_info leaks working storage into
a long-lived cache context.

* RelationDestroyRelation leaks rd_statlist; the patch that added
that failed to emulate the rd_indexlist prototype very well.

* As previously noted, RelationInitTableAccessMethod causes leaks.

* I made some effort to remove easily-removable leakage in
lookup_ts_dictionary_cache itself, although given the leaks in
its callees, this isn't terribly exciting.

I am suspicious that there's something still not quite right in the
memory context changes, because I noticed that the sanity_check.sql
test (and no other ones) complained about row_description_context being
leaked. I realized that the warning was because my compiler optimizes
away the row_description_context static variable altogether, leaving no
pointer to the context behind. I hacked around that in this patch by
marking that variable volatile. However, that explanation just begs
the question of why every run didn't show the same warning. I suppose
that valgrind was considering the context to be referenced by some
child or sibling context pointer, but if that's the explanation then
we should never have seen the warning. I'm forced to the conclusion
that valgrind is counting some but not all child/sibling context
pointers, which sure seems like a bug. Maybe we need the two-level-
mempool mechanism after all to get that to work better.

Anyway, I think we need to commit and even back-patch the portion
of the attached changes that are in
* libpqwalreceiver.c
* spell.h / spell.c
* relcache.c
* guc.c (except the quick hack in RestoreGUCState)
Those are all genuine leaks that are in places where they could be
repeated and thus potentially add up to something significant.

There's a weaker case for applying the changes in evtcache.c,
ts_cache.c, and typcache.c. Those are all basically leaving some cruft
behind in a long-lived cache entry. But the cruft isn't huge and it
would be recovered at cache flush, so I'm not convinced it amounts to a
real-world problem.

The rest of this is either working around valgrind shortcomings or
jumping through a hoop to convince it that some data structure that's
still around at exit is still referenced. Maybe we should commit some
of it under "#ifdef USE_VALGRIND" tests. I really want to find some
other answer than moving the dlist_node fields, though.

Comments?

regards, tom lane

Attachment Content-Type Size
all-valgrind-hacks.patch text/x-diff 24.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2021-03-18 21:57:13 Re: GROUP BY DISTINCT
Previous Message Euler Taveira 2021-03-18 21:44:53 Re: cleanup temporary files after crash