Re: Memory leak from ExecutorState context?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Memory leak from ExecutorState context?
Date: 2023-03-02 12:44:52
Message-ID: 77a96d42-00cb-2448-465a-aa1e92d00cac@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 3/2/23 13:08, Jehan-Guillaume de Rorthais wrote:
> ...
> [...]
>> But I have another idea - put a breakpoint on makeBufFile() which is the
>> bit that allocates the temp files including the 8kB buffer, and print in
>> what context we allocate that. I have a hunch we may be allocating it in
>> the ExecutorState. That'd explain all the symptoms.
>
> That what I was wondering as well yesterday night.
>
> So, on your advice, I set a breakpoint on makeBufFile:
>
> (gdb) info br
> Num Type Disp Enb Address What
> 1 breakpoint keep y 0x00000000007229df in makeBufFile
> bt 10
> p CurrentMemoryContext.name
>
>
> Then, I disabled it and ran the query up to this mem usage:
>
> VIRT RES SHR S %CPU %MEM
> 20.1g 7.0g 88504 t 0.0 22.5
>
> Then, I enabled the breakpoint and look at around 600 bt and context name
> before getting bored. They **all** looked like that:
>
> Breakpoint 1, BufFileCreateTemp (...) at buffile.c:201
> 201 in buffile.c
> #0 BufFileCreateTemp (...) buffile.c:201
> #1 ExecHashJoinSaveTuple (tuple=0x1952c180, ...) nodeHashjoin.c:1238
> #2 ExecHashJoinImpl (parallel=false, pstate=0x31a6418) nodeHashjoin.c:398
> #3 ExecHashJoin (pstate=0x31a6418) nodeHashjoin.c:584
> #4 ExecProcNodeInstr (node=<optimized out>) execProcnode.c:462
> #5 ExecProcNode (node=0x31a6418)
> #6 ExecSort (pstate=0x31a6308)
> #7 ExecProcNodeInstr (node=<optimized out>)
> #8 ExecProcNode (node=0x31a6308)
> #9 fetch_input_tuple (aggstate=aggstate(at)entry=0x31a5ea0)
>
> $421643 = 0x99d7f7 "ExecutorState"
>
> These 600-ish 8kB buffer were all allocated in "ExecutorState". I could
> probably log much more of them if more checks/stats need to be collected, but
> it really slow down the query a lot, granting it only 1-5% of CPU time instead
> of the usual 100%.
>

Bingo!

> So It's not exactly a leakage, as memory would be released at the end of the
> query, but I suppose they should be allocated in a shorter living context,
> to avoid this memory bloat, am I right?
>

Well, yeah and no.

In principle we could/should have allocated the BufFiles in a different
context (possibly hashCxt). But in practice it probably won't make any
difference, because the query will probably run all the hashjoins at the
same time. Imagine a sequence of joins - we build all the hashes, and
then tuples from the outer side bubble up through the plans. And then
you process the last tuple and release all the hashes.

This would not fix the issue. It'd be helpful for accounting purposes
(we'd know it's the buffiles and perhaps for which hashjoin node). But
we'd still have to allocate the memory etc. (so still OOM).

There's only one thing I think could help - increase the work_mem enough
not to trigger the explosive growth in number of batches. Imagine
there's one very common value, accounting for ~65MB of tuples. With
work_mem=64MB this leads to exactly the explosive growth you're
observing here. With 128MB it'd probably run just fine.

The problem is we don't know how large the work_mem would need to be :-(
So you'll have to try and experiment a bit.

I remembered there was a thread [1] about *exactly* this issue in 2019.

[1]
https://www.postgresql.org/message-id/flat/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net

I even posted a couple patches that try to address this by accounting
for the BufFile memory, and increasing work_mem a bit instead of just
blindly increasing the number of batches (ignoring the fact that means
more memory will be used for the BufFile stuff).

I don't recall why it went nowhere, TBH. But I recall there were
discussions about maybe doing something like "block nestloop" at the
time, or something. Or maybe the thread just went cold.

>> BTW with how many batches does the hash join start?
>
> * batches went from 32 to 1048576 before being growEnabled=false as suspected
> * original and current nbuckets were set to 1048576 immediately
> * allowed space is set to the work_mem, but current space usage is 1.3GB, as
> measured previously close before system refuse more memory allocation.
>

Yeah, I think this is pretty expected. We start with multiple batches,
so we pick optimal buckets for the whole work_mem (so no growth here).

But then batches explode, in the futile hope to keep this in work_mem.
Once that growth gets disabled, we end up with 1.3GB hash table.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Drouvot, Bertrand 2023-03-02 13:04:58 Re: Fix comments in gistxlogDelete, xl_heap_freeze_page and xl_btree_delete
Previous Message Daniel Gustafsson 2023-03-02 12:38:37 Re: libpq: PQgetCopyData() and allocation overhead