Re: Missed check for too-many-children in bgworker spawning

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Missed check for too-many-children in bgworker spawning
Date: 2019-10-09 22:26:58
Message-ID: 16897.1570660018@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Oct 9, 2019 at 10:21 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> We could improve on matters so far as the postmaster's child-process
>> arrays are concerned, by defining separate slot "pools" for the different
>> types of child processes. But I don't see much point if the code is
>> not prepared to recover from a fork() failure --- and if it is, that
>> would a fortiori deal with out-of-child-slots as well.

> I would say rather that if fork() is failing on your system, you have
> a not very stable system. The fact that parallel query is going to
> fail is sad, but not as sad as the fact that connecting to the
> database is also going to fail, and that logging into the system to
> try to fix the problem may well fail as well.

True, it's not a situation you especially want to be in. However,
I've lost count of the number of times that I've heard someone talk
about how their system was overstressed to the point that everything
else was failing, but Postgres kept chugging along. That's a good
reputation to have and we shouldn't just walk away from it.

> Code that tries to make
> parallel query cope with this situation without an error wouldn't
> often be tested, so it might be buggy, and it wouldn't necessarily be
> a benefit if it did work. I expect many people would rather have the
> query fail and free up slots in the system process table than consume
> precisely all of them and then try to execute the query at a
> slower-than-expected rate.

I find that argument to be utter bunkum. The parallel query code is
*already* designed to silently degrade performance when its primary
resource limit (shared bgworker slots) is exhausted. How can it be
all right to do that but not all right to cope with fork failure
similarly? If we think running up against the kernel limits is a
case that we can roll over and die on, why don't we rip out the
virtual-FD stuff in fd.c?

As for "might be buggy", if we ripped out every part of Postgres
that's under-tested, I'm afraid there might not be much left.
In any case, a sane design for this would make as much as possible
of the code handle "out of shared bgworker slots" just the same as
resource failures later on, so that there wouldn't be that big a gap
in coverage.

Having said all that, I made a patch that causes the postmaster
to reserve separate child-process-array slots for autovac workers
and bgworkers, as per attached, so that excessive connection
requests can't DOS those subsystems. But I'm not sure that it's
worth the complication; it wouldn't be necessary if the parallel
query launch code were more robust.

regards, tom lane

Attachment Content-Type Size
separate-limits-for-different-process-types-1.patch text/x-diff 5.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Steve Crawford 2019-10-09 22:39:11 Re: TCP Wrappers
Previous Message Alex Williams 2019-10-09 21:39:52 Re: pg_dump compatibility level / use create view instead of create table/rule