Re: Memory leak from ExecutorState context?

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Memory leak from ExecutorState context?
Date: 2023-03-01 17:48:40
Message-ID: 20230301184840.0a897a80@karst
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Tomas,

On Tue, 28 Feb 2023 20:51:02 +0100
Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
> > * HashBatchContext goes up to 1441MB after 240s then stay flat until the end
> > (400s as the last record)
>
> That's interesting. We're using HashBatchContext for very few things, so
> what could it consume so much memory? But e.g. the number of buckets
> should be limited by work_mem, so how could it get to 1.4GB?
>
> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
> and print how many batches/butches are there?

I did this test this morning.

Batches and buckets increased really quickly to 1048576/1048576.

ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands of
calls, always short-cut'ed to 1048576, I guess because of the conditional block
«/* safety check to avoid overflow */» appearing early in this function.

I disabled the breakpoint on ExecHashIncreaseNumBatches a few time to make the
query run faster. Enabling it at 19.1GB of memory consumption, it stayed
silent till the memory exhaustion (around 21 or 22GB, I don't remember exactly).

The breakpoint on ExecHashIncreaseNumBuckets triggered some times at beginning,
and a last time close to the end of the query execution.

> > Any idea? How could I help to have a better idea if a leak is actually
> > occurring and where exactly?
>
> Investigating memory leaks is tough, especially for generic memory
> contexts like ExecutorState :-( Even more so when you can't reproduce it
> on a machine with custom builds.
>
> What I'd try is this:
>
> 1) attach breakpoints to all returns in AllocSetAlloc(), printing the
> pointer and size for ExecutorState context, so something like
>
> break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
> commands
> print MemoryChunkGetPointer(chunk) size
> cont
> end
>
> 2) do the same for AllocSetFree()
>
> 3) Match the palloc/pfree calls (using the pointer address), to
> determine which ones are not freed and do some stats on the size.
> Usually there's only a couple distinct sizes that account for most of
> the leaked memory.

So here is what I end up with this afternoon, using file, lines and macro from
REL_11_18:

set logging on
set pagination off

break aset.c:781 if strcmp("ExecutorState",context.name) == 0
commands 1
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break aset.c:820 if strcmp("ExecutorState",context.name) == 0
commands 2
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break aset.c:979 if strcmp("ExecutorState",context.name) == 0
commands 3
print (((char *)(chunk)) + sizeof(struct AllocChunkData))
print chunk->size
cont
end

break AllocSetFree if strcmp("ExecutorState",context.name) == 0
commands 4
print pointer
cont
end

So far, gdb had more than 3h of CPU time and is eating 2.4GB of memory. The
backend had only 3'32" of CPU time:

VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2727284 2.4g 17840 R 99.0 7.7 181:25.07 gdb
9054688 220648 103056 t 1.3 0.7 3:32.05 postmaster

Interestingly, the RES memory of the backend did not explode yet, but VIRT is
already high.

I suppose the query will run for some more hours, hopefully, gdb will not
exhaust the memory in the meantime...

You'll find some intermediate stats I already collected in attachment:

* break 1, 2 and 3 are from AllocSetAlloc, break 4 is from AllocSetFree.
* most of the non-free'd chunk are allocated since the very beginning, before
the 5000's allocation call for almost 1M call so far.
* 3754 of them have a chunk->size of 0
* it seems there's some buggy stats or data:
# this one actually really comes from the gdb log
0x38a77b8: break=3 num=191 sz=4711441762604810240 (weird sz)
# this one might be a bug in my script
0x2: break=2 num=945346 sz=2 (weird address)
* ignoring the weird size requested during the 191st call, the total amount
of non free'd memory is currently 5488MB

I couldn't print "size" as it is optimzed away, that's why I tracked
chunk->size... Is there anything wrong with my current run and gdb log?

The gdb log is 5MB compressed. I'll keep it off-list, but I can provide it if
needed.

Stay tuned...

Thank you!

Attachment Content-Type Size
allocs-tmp-stats.txt.gz application/gzip 52.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2023-03-01 17:51:16 Add pg_walinspect function with block info columns
Previous Message Robert Haas 2023-03-01 17:44:51 Re: Weird failure with latches in curculio on v15