Re: terminate called after throwing an instance of 'std::bad_alloc'

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: terminate called after throwing an instance of 'std::bad_alloc'
Date: 2021-04-18 00:13:24
Message-ID: 20210418001324.GP3315@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 16, 2021 at 10:18:37PM -0500, Justin Pryzby wrote:
> On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote:
> > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> > > Hi,
> > >
> > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > > > I'd be happy to run with a prototype fix for the leak to see if the other issue
> > > > does (not) recur.
> > >
> > > I just posted a prototype fix to https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> > > (just because that was the first thread I re-found). It'd be cool if you
> > > could have a look!
> >
> > This doesn't seem to address the problem triggered by the reproducer at
> > https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com
> > (sorry I didn't CC you)
>
> I take that back - I forgot that this doesn't release RAM until hitting a
> threshold.

I tried this on the query that was causing the original c++ exception.

It still grows to 2GB within 5min.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23084 postgres 20 0 2514364 1.6g 29484 R 99.7 18.2 3:40.87 postgres: telsasoft ts 192.168.122.11(50892) SELECT

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23084 postgres 20 0 3046960 2.1g 29484 R 100.0 24.1 4:30.64 postgres: telsasoft ts 192.168.122.11(50892) SELECT

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23084 postgres 20 0 4323500 3.3g 29488 R 99.7 38.4 8:20.63 postgres: telsasoft ts 192.168.122.11(50892) SELECT

When I first reported this issue, the affected process was a long-running,
single-threaded python tool. We since updated it (partially to avoid issues
like this) to use multiprocessing, therefor separate postgres backends.

I'm now realizing that that's RAM use for a single query, not from continuous
leaks across multiple queries. This is still true with the patch even if I
#define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28438 postgres 20 0 3854264 2.8g 29428 R 98.7 33.2 8:56.79 postgres: telsasoft ts 192.168.122.11(53614) BIND

python3 ./jitleak.py # runs telsasoft reports
INFO: recreating LLVM context after 2 uses
INFO: recreating LLVM context after 2 uses
INFO: recreating LLVM context after 2 uses
INFO: recreating LLVM context after 2 uses
INFO: recreating LLVM context after 2 uses
PID 27742 finished running report; est=None rows=40745; cols=34; ... duration:538
INFO: recreating LLVM context after 81492 uses

I did:

- llvm_llvm_context_reuse_count = 0;
Assert(llvm_context != NULL);
+ elog(INFO, "recreating LLVM context after %zu uses", llvm_llvm_context_reuse_count);
+ llvm_llvm_context_reuse_count = 0;

Maybe we're missing this condition somehow ?
if (llvm_jit_context_in_use_count == 0 &&

Also, I just hit this assertion by cancelling the query with ^C / sigint. But
I don't have a reprodcer for it.

< 2021-04-17 19:14:23.509 ADT telsasoft >PANIC: LLVMJitContext in use count not 0 at exit (is 1)

--
Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-04-18 02:06:28 Re: Replication slot stats misgivings
Previous Message Tom Lane 2021-04-17 22:21:15 Re: Replication slot stats misgivings