Re: Memory leak (possibly connected to postgis) leading to server crash

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Roman Cervenak <roman(at)cervenak(dot)info>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Memory leak (possibly connected to postgis) leading to server crash
Date: 2019-12-06 12:41:19
Message-ID: 20191206124119.qe2xaffcbkcbmve3@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Dec 06, 2019 at 12:46:44PM +0100, Roman Cervenak wrote:
>Yes, it was killed by oom killer:
>
>[2037990.376427]
>oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/system-postgresql.slice,task=postgres,pid=52059,uid=111
>[2037990.376433] Out of memory: Kill process 52059 (postgres) score 294 or
>sacrifice child
>[2037990.384186] Killed process 52059 (postgres) total-vm:17508832kB,
>anon-rss:4309296kB, file-rss:108kB, shmem-rss:12641580kB
>[2037990.516504] oom_reaper: reaped process 52059 (postgres), now
>anon-rss:0kB, file-rss:0kB, shmem-rss:12641580kB
>
>(full dmesg.log attached, if it is interesting; there are more postgres
>backends visible, but they were inactive at the time)
>

OK, so it allocated extra ~4.3GB or so (plus shared buffers).

>I can try the gdb dump next time I will see it. But I cannot imagine giving
>you reproducible case - it is 500 GB proprietary database, and without it,
>queries would be hardly useful, I presume? I can try to make "sample" with
>generated data and find out if I can reproduce the issue with my queries
>that way, but that will be quite time consuming.
>

Yeah, I understand the data is proprietary. But we need to identify
where the issue is, somehow. And being able to reproduce it is likely
the best way to do that. One way to do that is to either generate
synthetic data with similar structure/features (geometries of similar
size etc.), and hope that it triggers the issue too. Or distill a subset
of data triggering the issue and anonymize it. Or something like that.

Yes, it's going to be quite time consuming, but I don't have better
ideas. Maybe running it under valgrind would help - you'd have to
rebuild PostgreSQL with valgrind support and run the workload on it. The
queries would be much slower (possibly by an order of magnitude or so),
so you'd have to run it on a different machine, but that's machine time,
not time wasted by a human.

Anyway, let's start with the easy stuff - try running it again and get
the memory context stats. Maybe it's a simple leak in PostgreSQL, in
which case it should be easier to investigate it. If that turns out to
be untrue, you can try this more complicated stuff (valgrind, ...).

>Would it help to dump the memory of the backend process and deliver the
>dump (by some private channel) to somebody to identify who is consuming all
>that memory? (that is the usual drill in windows)
>

I don't think that'd be very helpful - if this really is memory leak in
one of the libraries, I have no idea how to spot that in the memory
dump. Also, if you claim the data is sensitive/proprietary, you should
not really be sharing the dumps.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2019-12-06 15:14:16 Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)
Previous Message Roman Cervenak 2019-12-06 11:46:44 Re: Memory leak (possibly connected to postgis) leading to server crash