Re: Crash during backend start when low on memory

From: Mats Kindahl <mats(at)timescale(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: daniel(at)yesql(dot)se, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Crash during backend start when low on memory
Date: 2023-01-13 10:05:01
Message-ID: CA+14424mP36yJ=tUmJ-EBiQqUTc+7g6p_1+gBFymGoFxc-Lsow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Dec 16, 2022 at 9:46 AM Mats Kindahl <mats(at)timescale(dot)com> wrote:

> On Thu, Dec 15, 2022 at 6:20 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> wrote:
>
>> On 2022-Dec-14, Mats Kindahl wrote:
>>
>> > PostmasterContext is set quite early in the procedure so it should not
>> be a
>> > problem to use it. I think we can use PostmasterContext rather than
>> > TopMemoryContext. The memory for the allocations are passed down to
>> > PostmasterMain() and used inside InitPostgres() but as far as I can
>> tell,
>> > they are not used after the InitPostgres() call. The PostmasterContex is
>> > destroyed right after the call to InitPostgres() inside PostmasterMain()
>> > and then the rest is allocated in other contexts.
>>
>> Well, this is what PostgresMain has to say about it:
>>
>> /*
>> * If the PostmasterContext is still around, recycle the space;
>> we don't
>> * need it anymore after InitPostgres completes. Note this does
>> not trash
>> * *MyProcPort, because ConnCreate() allocated that space with
>> malloc()
>> * ... else we'd need to copy the Port data first. Also,
>> subsidiary data
>> * such as the username isn't lost either; see
>> ProcessStartupPacket().
>> */
>> if (PostmasterContext)
>> {
>> MemoryContextDelete(PostmasterContext);
>>
>>
> Ugh. Ok. Missed that comment.
>
>
>> So I think blowing away part of struct Port independently of the
>> containing struct is a bad idea. Why isn't more appropriate to do what
>> ConnCreate does?
>>
>> if (!(port = (Port *) calloc(1, sizeof(Port))))
>> {
>> ereport(LOG,
>> (errcode(ERRCODE_OUT_OF_MEMORY),
>> errmsg("out of memory")));
>> ExitPostmaster(1);
>> }
>>
>> ... ugh. I have to admit that killing postmaster due to a transient
>> out-of-memory makes me a bit nervous -- do note that this
>> ExitPostmaster() happens before we've forked the backend, so this code
>> does bring the server down (!!). If we want to make postmaster more
>> robust in the face of OOM, we should also change this code somehow, yes?
>>
>
> Seems reasonable, yes. I'll see what I can throw together.
>
>
>>
>> I agree with Mats' premise when starting the thread -- even if the
>> server is not in great shape, it doesn't seem like taking the whole
>> service down is a great response. It would be better to just refuse the
>> connection with some error and let existing connections chug along,
>> poorly though they can. Maybe they'll also die soon, freeing memory;
>> but that'll be no good if Postmaster is gone altogether.
>>
>
>> Now, it would be better if we could test this more thoroughly. Perhaps
>> it's possible to use something like the mallocfail.c lib that Heikki
>> wrote a few years ago, teaching it to handle calloc
>> https://postgr.es/m/547480DE.4040408@vmware.com
>
>
> Thanks. Will take a look at that.
>

I have an improved patch based on raising an error when malloc fails. I
used the version of mallocfail that is available at
https://github.com/ralight/mallocfail which allows you to run the program
with incrementally failing memory allocations but since the count is around
6886 or so when the postmaster starts to allow backend starts and other
processes are starting in between, this count is not always accurate and
something "smarter" is needed to get a reliable failure test.

It is quite straightforward to test it by attaching a debugger and setting
the return value of malloc, but to be able to have some "smarter" failures
only when, for example, starting a backend it would be necessary to have
something better integrated with the postgres code. Being able to attach a
uprobe using bpftrace and override the return value would be the most
elegant solution, but that is currently not possible, so considering
whether to add something like the mallocfail above to the regular testing.

Note: there are a bunch of other core dumps that occur during startup
before the postmaster is ready to accept connections, but I think that is
less of a problem since it fails during startup before the system is ready
to accept connections.

Best wishes,
Mats Kindahl, Timescale

>
>>
>> --
>> Álvaro Herrera Breisgau, Deutschland —
>> https://www.EnterpriseDB.com/
>> "Oh, great altar of passive entertainment, bestow upon me thy discordant
>> images
>> at such speed as to render linear thought impossible" (Calvin a la TV)
>>
>

Attachment Content-Type Size
0001-Check-return-value-of-strdup-when-starting-backend.patch application/x-patch 6.3 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message wangw.fnst@fujitsu.com 2023-01-13 11:17:42 RE: Logical Replica ReorderBuffer Size Accounting Issues
Previous Message Kyotaro Horiguchi 2023-01-13 09:36:05 Re: BUG #17744: Fail Assert while recoverying from pg_basebackup