Re: terminate called after throwing an instance of 'std::bad_alloc'

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: terminate called after throwing an instance of 'std::bad_alloc'
Date: 2020-12-18 23:56:07
Message-ID: 20201218235607.GC30237@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm re-raising this issue (since I needed to track down why I'd reflexively
saved a 512g backup of the filesystem where this first happened).

I'm 99% sure the "bad_alloc" is from LLVM. It happened multiple times on
different servers (running a similar report) after setting jit=on during pg13
upgrade, and never happened since re-setting jit=off.

I reproduced a memory leak, but I don't think you publicized a fix for it.

I *think* the memory leak probably explains the bad_alloc. We were seeing
postgres killed by OOM every ~16hours, although I don't see how that's possible
from such a tiny leak. Our query also uses postgis (at first I said it
didn't), so it may be that they're involved.

I'd be happy to run with a prototype fix for the leak to see if the other issue
does (not) recur.

On Wed, Oct 07, 2020 at 04:02:42PM -0700, Andres Freund wrote:
> Hi,
>
> Sorry for the second send of this email, but somehow I managed to mangle
> the headers in the last version I sent.
>
> On 2020-10-03 19:01:27 -0700, Andres Freund wrote:
> > On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote:
> > > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote:
> > > > > I was able to make a small, apparent leak like so. This applies to
> > > > > postgis3.0/postgres12/LLVM5/centos7, and
> > > > > postgis3.1alpha/postgres13/LLVM9/centos8.
> > > > >
> > > > > Inlining is essential to see the leak, optimization is not. Showing every 9th
> > > > > query:
> > > > >
> > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0; SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyze SELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p'
> > > > > ! 46024 kB max resident size
> > > > > ! 46492 kB max resident size
> > > >
> > > > Huh, that's certainly not good. Looking.
> > >
> > > Reproduce like:
> > > postgres=# CREATE EXTENSION postgis;
> > > postgres=# CREATE TABLE geo(g geometry(point));
> > > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999);
> > > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET log_executor_stats=on; explain analyze SELECT st_x(g) FROM geo;
> >
> > It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically
> > whenever we load an IR module to inline from, it renames its type names
> > to make them not conflict with already known type names in the current
> > LLVMContext. That, over time, leads to a large number of type names in
> > memory. We can avoid the re-loading of the module by instead cloning it,
> > but that still leads to new types being created for some internal
> > reasons. It's possible to create/drop separate LLVMContext instances,
> > but we'd have to figure out when to do so - it's too expensive to always
> > do that.
>
> I spent a lot of time (probably too much time) trying to find a good way
> out of that. I think I mostly figured it out, but it's gonna take me a
> bit longer to nail down the loose ends. The nice part is that it
> actually speeds up inlining a lot. The bad part is that it's not a small
> change, so I don't yet know how to deal with existing releases.
>
> Details:
>
> The issue with the leaks is basically that when LLVM loads a bitcode
> module it doesn't "unify" named types with types in other bitcode module
> - they could after all be entirely independent. That means that we end
> up with a lot of types after a while of running.
>
> That ought to be addressable by not re-loading modules we inline from
> from disk every time we inline, but unfortunately that doesn't work well
> either: The "cross module import" logic unfortunately modifies the types
> used in the "source module", leading to subsequent imports not working
> well. That last issue is why I had moved to re-loading modules from
> "disk" during import.
>
> The partial patch I've written doesn't use the existing cross module
> import code anymore, but moves to the lower level functionality it
> uses. For traditional compilers, which is what that linker is mainly
> written for, it makes sense to reduce memory usage by "destructively"
> moving code from the source into the target - it'll only be used
> once. But for the JIT case it's better to do so non-destructively. That
> requires us to implement some "type resolution" between the source and
> target modules, but then gains not needing to reload / copy the source
> module anymore, saving a lot of the costs.
>
>
> It's possible that for existing releases we ought to go for a simpler
> approach, even if it's not as good. The only really feasible way I see
> is to basically count the number of times inlining has been done, and
> then just re-initialize after a certain time. But that's pretty darn
> yucky.
>
> Greetings,
>
> Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-12-19 00:35:57 Re: Refactoring HMAC in the core code
Previous Message Tom Lane 2020-12-18 22:14:31 Re: [PATCH] Implements SPIN_LOCK on ARM