Re: Memory leak from ExecutorState context?

From: Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Memory leak from ExecutorState context?
Date: 2023-03-01 23:18:27
Message-ID: 20230302001827.66e95dc3@karst
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, 1 Mar 2023 20:29:11 +0100
Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> On 3/1/23 18:48, Jehan-Guillaume de Rorthais wrote:
> > On Tue, 28 Feb 2023 20:51:02 +0100
> > Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >> On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
> >>> * HashBatchContext goes up to 1441MB after 240s then stay flat until the
> >>> end (400s as the last record)
> >>
> >> That's interesting. We're using HashBatchContext for very few things, so
> >> what could it consume so much memory? But e.g. the number of buckets
> >> should be limited by work_mem, so how could it get to 1.4GB?
> >>
> >> Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
> >> and print how many batches/butches are there?
> >
> > I did this test this morning.
> >
> > Batches and buckets increased really quickly to 1048576/1048576.
>
> OK. I think 1M buckets is mostly expected for work_mem=64MB. It means
> buckets will use 8MB, which leaves ~56B per tuple (we're aiming for
> fillfactor 1.0).
>
> But 1M batches? I guess that could be problematic. It doesn't seem like
> much, but we need 1M files on each side - 1M for the hash table, 1M for
> the outer relation. That's 16MB of pointers, but the files are BufFile
> and we keep 8kB buffer for each of them. That's ~16GB right there :-(
>
> In practice it probably won't be that bad, because not all files will be
> allocated/opened concurrently (especially if this is due to many tuples
> having the same value). Assuming that's what's happening here, ofc.

And I suppose they are close/freed concurrently as well?

> > ExecHashIncreaseNumBatches was really chatty, having hundreds of thousands
> > of calls, always short-cut'ed to 1048576, I guess because of the
> > conditional block «/* safety check to avoid overflow */» appearing early in
> > this function.
>
> Hmmm, that's a bit weird, no? I mean, the check is
>
> /* safety check to avoid overflow */
> if (oldnbatch > Min(INT_MAX / 2, MaxAllocSize / (sizeof(void *) * 2)))
> return;
>
> Why would it stop at 1048576? It certainly is not higher than INT_MAX/2
> and with MaxAllocSize = ~1GB the second value should be ~33M. So what's
> happening here?

Indeed, not the good suspect. But what about this other short-cut then?

/* do nothing if we've decided to shut off growth */
if (!hashtable->growEnabled)
return;

[...]

/*
* If we dumped out either all or none of the tuples in the table, disable
* further expansion of nbatch. This situation implies that we have
* enough tuples of identical hashvalues to overflow spaceAllowed.
* Increasing nbatch will not fix it since there's no way to subdivide the
* group any more finely. We have to just gut it out and hope the server
* has enough RAM.
*/
if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
#ifdef HJDEBUG
printf("Hashjoin %p: disabling further increase of nbatch\n",
hashtable);
#endif
}

If I guess correctly, the function is not able to split the current batch, so
it sits and hopes. This is a much better suspect and I can surely track this
from gdb.

Being able to find what are the fields involved in the join could help as well
to check or gather some stats about them, but I hadn't time to dig this yet...

[...]
> >> Investigating memory leaks is tough, especially for generic memory
> >> contexts like ExecutorState :-( Even more so when you can't reproduce it
> >> on a machine with custom builds.
> >>
> >> What I'd try is this:

[...]
> > I couldn't print "size" as it is optimzed away, that's why I tracked
> > chunk->size... Is there anything wrong with my current run and gdb log?
>
> There's definitely something wrong. The size should not be 0, and
> neither it should be > 1GB. I suspect it's because some of the variables
> get optimized out, and gdb just uses some nonsense :-(
>
> I guess you'll need to debug the individual breakpoints, and see what's
> available. It probably depends on the compiler version, etc. For example
> I don't see the "chunk" for breakpoint 3, but "chunk_size" works and I
> can print the chunk pointer with a bit of arithmetics:
>
> p (block->freeptr - chunk_size)
>
> I suppose similar gympastics could work for the other breakpoints.

OK, I'll give it a try tomorrow.

Thank you!

NB: the query has been killed by the replication.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-03-01 23:26:33 Re: stopgap fix for signal handling during restore_command
Previous Message Andres Freund 2023-03-01 23:13:04 Re: stopgap fix for signal handling during restore_command