Reducing sema usage (was Postmaster dies with many child processes)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Reducing sema usage (was Postmaster dies with many child processes)
Date: 1999-01-31 00:11:54
Message-ID: 5737.917741514@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I said:
> Another thing we ought to look at is changing the use of semaphores so
> that Postgres uses a fixed number of semaphores, not a number that
> increases as more and more backends are started. Kernels are
> traditionally configured with very low limits for the SysV IPC
> resources, so having a big appetite for semaphores is a Bad Thing.

I've been looking into this issue today, and it looks possible but messy.

The source of the problem is the lock manager
(src/backend/storage/lmgr/proc.c), which wants to be able to wake up a
specific process that is blocked on a lock. I had first thought that it
would be OK to wake up any one of the processes waiting for a lock, but
after looking at the lock manager that seems a bad idea --- considerable
thought has gone into the queuing order of waiting processes, and we
don't want to give that up. So we need to preserve this ability.

The way it's currently done is that each extant backend has its own
SysV-style semaphore, and when you want to wake up a particular backend
you just V() its semaphore. (BTW, the semaphores get allocated in
chunks of 16, so an out-of-semaphores condition will always occur when
trying to start the 16*N+1'th backend...) This is simple and reliable
but fails if you want to have more backends than the kernel has SysV
semaphores. Unfortunately kernels are usually configured with not
very many semaphores --- 64 or so is typical. Also, running the system
down to nearly zero free semaphores is likely to cause problems for
other subsystems even if Postgres itself doesn't run out.

What seems practical to do instead is this:
* At postmaster startup, allocate a fixed number of semaphores for
use by all child backends. ("Fixed" can really mean "configurable",
of course, but the point is we won't ask for more later.)
* The semaphores aren't dedicated to use by particular backends.
Rather, when a backend needs to block, it finds a currently free
semaphore and grabs it for the duration of its wait. The number
of the semaphore a backend is using to wait with would be recorded
in its PROC struct, and we'd also need an array of per-sema data
to keep track of free and in-use semaphores.
* This works with very little extra overhead until we have more
simultaneously-blocked backends than we have semaphores. When that
happens (which we hope is really seldom), we overload semaphores ---
that is, we use the same sema to block two or more backends. Then
the V() operation by the lock's releaser might wake the wrong backend.
So, we need an extra field in the LOCK struct to identify the intended
wake-ee. When a backend is released in ProcSleep, it has to look at
the lock it is waiting on to see if it is supposed to be wakened
right now. If not, it V()s its shared semaphore a second time (to
release the intended wakee), then P()s the semaphore again to go
back to sleep itself. There probably has to be a delay in here,
to ensure that the intended wakee gets woken and we don't have its
bed-mates indefinitely trading wakeups among the wrong processes.
This is why we don't want this scenario happening often.

I think this could be made to work, but it would be a delicate and
hard-to-test change in what is already pretty subtle code.

A considerably more straightforward approach is just to forget about
incremental allocation of semaphores and grab all we could need at
postmaster startup. ("OK, Mac, you told me to allow up to N backends?
Fine, I'm going to grab N semaphores at startup, and if I can't get them
I won't play.") This would force the DB admin to either reconfigure the
kernel or reduce MaxBackendId to something the kernel can support right
off the bat, rather than allowing the problem to lurk undetected until
too many clients are started simultaneously. (Note there are still
potential gotchas with running out of processes, swap space, or file
table slots, so we wouldn't have really guaranteed that N backends can
be started safely.)

If we make MaxBackendId settable from a postmaster command-line switch
then this second approach is probably not too inconvenient, though it
surely isn't pretty.

Any thoughts about which way to jump? I'm sort of inclined to take
the simpler approach myself...

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 1999-01-31 00:42:25 Re: Reducing sema usage (was Postmaster dies with many child processes)
Previous Message Tom Lane 1999-01-30 21:07:07 Re: [HACKERS] nested loops in joins, ambiguous rewrite rules