Re: Memory leak from ExecutorState context?

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Memory leak from ExecutorState context?
Date: 2023-03-28 15:25:49
Message-ID: f532d33d-e4c1-ed00-e047-94f3aafa49db@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/28/23 15:17, Jehan-Guillaume de Rorthais wrote:
> On Tue, 28 Mar 2023 00:43:34 +0200
> Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
>> On 3/27/23 23:13, Jehan-Guillaume de Rorthais wrote:
>>> Please, find in attachment a patch to allocate bufFiles in a dedicated
>>> context. I picked up your patch, backpatch'd it, went through it and did
>>> some minor changes to it. I have some comment/questions thought.
>>>
>>> 1. I'm not sure why we must allocate the "HashBatchFiles" new context
>>> under ExecutorState and not under hashtable->hashCxt?
>>>
>>> The only references I could find was in hashjoin.h:30:
>>>
>>> /* [...]
>>> * [...] (Exception: data associated with the temp files lives in the
>>> * per-query context too, since we always call buffile.c in that
>>> context.)
>>>
>>> And in nodeHashjoin.c:1243:ExecHashJoinSaveTuple() (I reworded this
>>> original comment in the patch):
>>>
>>> /* [...]
>>> * Note: it is important always to call this in the regular executor
>>> * context, not in a shorter-lived context; else the temp file buffers
>>> * will get messed up.
>>>
>>>
>>> But these are not explanation of why BufFile related allocations must be
>>> under a per-query context.
>>>
>>
>> Doesn't that simply describe the current (unpatched) behavior where
>> BufFile is allocated in the per-query context?
>
> I wasn't sure. The first quote from hashjoin.h seems to describe a stronger
> rule about «**always** call buffile.c in per-query context». But maybe it ought
> to be «always call buffile.c from one of the sub-query context»? I assume the
> aim is to enforce the tmp files removal on query end/error?
>

I don't think we need this info for tempfile cleanup - CleanupTempFiles
relies on the VfdCache, which does malloc/realloc (so it's not using
memory contexts at all).

I'm not very familiar with this part of the code, so I might be missing
something. But you can try that - just stick an elog(ERROR) somewhere
into the hashjoin code, and see if that breaks the cleanup.

Not an explicit proof, but if there was some hard-wired requirement in
which memory context to allocate BufFile stuff, I'd expect it to be
documented in buffile.c. But that actually says this:

* Note that BufFile structs are allocated with palloc(), and therefore
* will go away automatically at query/transaction end. Since the
underlying
* virtual Files are made with OpenTemporaryFile, all resources for
* the file are certain to be cleaned up even if processing is aborted
* by ereport(ERROR). The data structures required are made in the
* palloc context that was current when the BufFile was created, and
* any external resources such as temp files are owned by the ResourceOwner
* that was current at that time.

which I take as confirmation that it's legal to allocate BufFile in any
memory context, and that cleanup is handled by the cache in fd.c.

>> I mean, the current code calls BufFileCreateTemp() without switching the
>> context, so it's in the ExecutorState. But with the patch it very clearly is
>> not.
>>
>> And I'm pretty sure the patch should do
>>
>> hashtable->fileCxt = AllocSetContextCreate(hashtable->hashCxt,
>> "HashBatchFiles",
>> ALLOCSET_DEFAULT_SIZES);
>>
>> and it'd still work. Or why do you think we *must* allocate it under
>> ExecutorState?
>
> That was actually my very first patch and it indeed worked. But I was confused
> about the previous quoted code comments. That's why I kept your original code
> and decided to rise the discussion here.
>

IIRC I was just lazy when writing the experimental patch, there was not
much thought about stuff like this.

> Fixed in new patch in attachment.
>
>> FWIW The comment in hashjoin.h needs updating to reflect the change.
>
> Done in the last patch. Is my rewording accurate?
>
>>> 2. Wrapping each call of ExecHashJoinSaveTuple() with a memory context
>>> switch seems fragile as it could be forgotten in futur code path/changes.
>>> So I added an Assert() in the function to make sure the current memory
>>> context is "HashBatchFiles" as expected.
>>> Another way to tie this up might be to pass the memory context as
>>> argument to the function.
>>> ... Or maybe I'm over precautionary.
>>>
>>
>> I'm not sure I'd call that fragile, we have plenty other code that
>> expects the memory context to be set correctly. Not sure about the
>> assert, but we don't have similar asserts anywhere else.
>
> I mostly sticked it there to stimulate the discussion around this as I needed
> to scratch that itch.
>
>> But I think it's just ugly and overly verbose
>
> +1
>
> Your patch was just a demo/debug patch by the time. It needed some cleanup now
> :)
>
>> it'd be much nicer to e.g. pass the memory context as a parameter, and do
>> the switch inside.
>
> That was a proposition in my previous mail, so I did it in the new patch. Let's
> see what other reviewers think.
>

+1

>>> 3. You wrote:
>>>
>>>>> A separate BufFile memory context helps, although people won't see it
>>>>> unless they attach a debugger, I think. Better than nothing, but I was
>>>>> wondering if we could maybe print some warnings when the number of batch
>>>>> files gets too high ...
>>>
>>> So I added a WARNING when batches memory are exhausting the memory size
>>> allowed.
>>>
>>> + if (hashtable->fileCxt->mem_allocated > hashtable->spaceAllowed)
>>> + elog(WARNING, "Growing number of hash batch is exhausting
>>> memory");
>>>
>>> This is repeated on each call of ExecHashIncreaseNumBatches when BufFile
>>> overflows the memory budget. I realize now I should probably add the
>>> memory limit, the number of current batch and their memory consumption.
>>> The message is probably too cryptic for a user. It could probably be
>>> reworded, but some doc or additionnal hint around this message might help.
>>>
>>
>> Hmmm, not sure is WARNING is a good approach, but I don't have a better
>> idea at the moment.
>
> I stepped it down to NOTICE and added some more infos.
>
> Here is the output of the last patch with a 1MB work_mem:
>
> =# explain analyze select * from small join large using (id);
> WARNING: increasing number of batches from 1 to 2
> WARNING: increasing number of batches from 2 to 4
> WARNING: increasing number of batches from 4 to 8
> WARNING: increasing number of batches from 8 to 16
> WARNING: increasing number of batches from 16 to 32
> WARNING: increasing number of batches from 32 to 64
> WARNING: increasing number of batches from 64 to 128
> WARNING: increasing number of batches from 128 to 256
> WARNING: increasing number of batches from 256 to 512
> NOTICE: Growing number of hash batch to 512 is exhausting allowed memory
> (2164736 > 2097152)
> WARNING: increasing number of batches from 512 to 1024
> NOTICE: Growing number of hash batch to 1024 is exhausting allowed memory
> (4329472 > 2097152)
> WARNING: increasing number of batches from 1024 to 2048
> NOTICE: Growing number of hash batch to 2048 is exhausting allowed memory
> (8626304 > 2097152)
> WARNING: increasing number of batches from 2048 to 4096
> NOTICE: Growing number of hash batch to 4096 is exhausting allowed memory
> (17252480 > 2097152)
> WARNING: increasing number of batches from 4096 to 8192
> NOTICE: Growing number of hash batch to 8192 is exhausting allowed memory
> (34504832 > 2097152)
> WARNING: increasing number of batches from 8192 to 16384
> NOTICE: Growing number of hash batch to 16384 is exhausting allowed memory
> (68747392 > 2097152)
> WARNING: increasing number of batches from 16384 to 32768
> NOTICE: Growing number of hash batch to 32768 is exhausting allowed memory
> (137494656 > 2097152)
>
> QUERY PLAN
> --------------------------------------------------------------------------
> Hash Join (cost=6542057.16..7834651.23 rows=7 width=74)
> (actual time=558502.127..724007.708 rows=7040 loops=1)
> Hash Cond: (small.id = large.id)
> -> Seq Scan on small (cost=0.00..940094.00 rows=94000000 width=41)
> (actual time=0.035..3.666 rows=10000 loops=1)
> -> Hash (cost=6542057.07..6542057.07 rows=7 width=41)
> (actual time=558184.152..558184.153 rows=700000000 loops=1)
> Buckets: 32768 (originally 1024)
> Batches: 32768 (originally 1)
> Memory Usage: 1921kB
> -> Seq Scan on large (cost=0.00..6542057.07 rows=7 width=41)
> (actual time=0.324..193750.567 rows=700000000 loops=1)
> Planning Time: 1.588 ms
> Execution Time: 724011.074 ms (8 rows)
>
> Regards,

OK, although NOTICE that may actually make it less useful - the default
level is WARNING, and regular users are unable to change the level. So
very few people will actually see these messages.

thanks

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jelte Fennema 2023-03-28 15:54:06 Re: [EXTERNAL] Re: Add non-blocking version of PQcancel
Previous Message Masahiko Sawada 2023-03-28 14:59:40 Re: Initial Schema Sync for Logical Replication