dynamic shared memory

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: dynamic shared memory
Date: 2013-08-14 01:09:06
Message-ID: CA+TgmoaDqDUgt=4Zs_QPOnBt=EstEaVNP+5t+m=FPNWshiPR3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Please find attached a first version of a patch to allow additional
"dynamic" shared memory segments; that is, shared memory segments that
are created after server startup, live for a period of time, and are
then destroyed when no longer needed. The main purpose of this patch
is to facilitate parallel query: if we've got multiple backends
working on the same query, they're going to need a way to communicate.
Doing that through the main shared memory segment seems infeasible
because we could, for some applications, need to share very large
amounts of data. For example, for internal sort, we basically load
the data to be sorted into memory and then rearrange an array of
pointers to the items being sorted. For parallel internal sort, we
might want to do much the same thing, but with different backend
processes manipulating different parts of the array. I'm not exactly
sure how that's going to work out yet in detail, but it seems fair to
say that the amount of data we want to share between processes there
could be quite a bit larger than anything we'd feel comfortable
nailing down in the permanent shared memory segment. Other cases,
like parallel sequential scan, might require much smaller buffers,
since there might not be much point in letting the scan get too far
ahead if nothing's consuming the tuples it produces. With this
infrastructure, we can choose at run-time exactly how much memory to
allocate for a particular purpose and return it to the operating
system as soon as we're done with it.

Creating a shared memory segment is a somewhat operating-system
dependent task. I decided that it would be smart to support several
different implementations and to let the user choose which one they'd
like to use via a new GUC, dynamic_shared_memory_type. Since we
currently require System V shared memory to be supported on all
platforms other than Windows, I have included a System V
implementation (shmget, shmctl, shmat, shmdt). However, as we know,
on many systems, System V shared memory limits are often low out of
the box and raising them is an annoyance for users. Therefore, I've
included an implementation based on POSIX shared memory facilities
(shm_open, shm_unlink), which is the default on systems where those
facilities are supported (some of the BSDs do not, I believe). We
will also need a Windows implementation, which I have not attempted,
but one of my colleagues at EnterpriseDB will be filling in that gap.

In addition, I've included an implementation based on mmap of a plain
file. As compared with a true shared memory implementation, this
obviously has the disadvantage that the OS may be more likely to
decide to write back dirty pages to disk, which could hurt
performance. However, I believe it's worthy of inclusion all the
same, because there are a variety of situations in which it might be
more convenient than one of the other implementations. One is
debugging. On MacOS X, for example, there seems to be no way to list
POSIX shared memory segments, and no easy way to inspect the contents
of either POSIX or System V shared memory segments. Another use case
is working around an administrator-imposed or OS-imposed shared memory
limit. If you're not allowed to allocate shared memory, but you are
allowed to create files, then this implementation will let you use
whatever facilities we build on top of dynamic shared memory anyway.
A third possible reason to use this implementation is
compartmentalization. For example, you can put the directory that
stores the dynamic shared memory segments on a RAM disk - which
removes the performance concern - and then do whatever you like with
that directory: secure it, put filesystem quotas on it, or sprinkle
magic pixie dust on it. It doesn't even seem out of the question that
there might be cases where there are multiple RAM disks present with
different performance characteristics (e.g. on NUMA machines) and this
would provide fine-grained control over where your shared memory
segments get placed. To make a long story short, I won't be crushed
if the consensus is against including this, but I think it's useful.

Other implementations are imaginable but not implemented here. For
example, you can imagine using the mmap() of an anonymous file.
However, since the point is that these segments are created on the fly
by individual backends and then shared with other backends, that gets
a little tricky. In order for the second backend to map the same
anonymous shared memory segment that the first one mapped, you'd have
to pass the file descriptor from one process to the other. There are
ways, on most if not all platforms, to pass file descriptors through
sockets, but there's not automatically a socket connection between the
two processes either, so it gets hairy to think about making this
work. I did, however, include a "none" implementation which has the
effect of shutting the facility off altogether.

The actual implementation is split up into two layers. dsm_impl.c/h
encapsulate the implementation-dependent functionality at a very raw
level, while dsm.c/h wrap that functionality in a more palatable API.
Most of that wrapper layer is concerned with just one problem:
avoiding leaks. This turned out to require multiple levels of
safeguards, which I duly implemented. First, dynamic shared memory
segments need to be reference-counted, so that when the last mapping
is removed, the segment automatically goes away (we could allow for
server-lifespan segments as well with only trivial changes, but I'm
not sure whether there are compelling use cases for that). If a
backend is terminated uncleanly, the postmaster needs to remove all
leftover segments during the crash-and-restart process, just as it
needs to reinitialize the main shared memory segment. And if all
processes are terminated uncleanly, the next postmaster startup needs
to clean up any segments that still exist, again just as we already do
for the main shared memory segment. Neither POSIX shared memory nor
System V shared memory provide an API for enumerating all existing
shared memory segments, so we must keep track ourselves of what we
have outstanding. Second, we need to ensure, within the scope of an
individual process, that we only retain a mapping for as long as
necessary. Just as memory contexts, locks, buffer pins, and other
resources automatically go away at the end of a query or
(sub)transaction, dynamic shared memory mappings created for a purpose
such as parallel sort need to go away if we abort mid-way through. Of
course, if you have a user backend coordinating with workers, it seems
pretty likely that the workers are just going to exit if they hit an
error, so having the mapping be process-lifetime wouldn't necessarily
be a big deal; but the user backend may stick around for a long time
and execute other queries, and we can't afford to have it accumulate
mappings, not least because that's equivalent to a session-lifespan
memory leak.

To help solve these problems, I invented something called the "dynamic
shared memory control segment". This is a dynamic shared memory
segment created at startup (or reinitialization) time by the
postmaster before any user process are created. It is used to store a
list of the identities of all the other dynamic shared memory segments
we have outstanding and the reference count of each. If the
postmaster goes through a crash-and-reset cycle, it scans the control
segment and removes all the other segments mentioned there, and then
recreates the control segment itself. If the postmaster is killed off
(e.g. kill -9) and restarted, it locates the old control segment and
proceeds similarly. If the whole operating system is rebooted, the
old control segment won't exist any more, but that's OK, because none
of the other segments will either - except under the
mmap-a-regular-file implementation, which handles cleanup by scanning
the relevant directory rather than relying on the control segment.
These precautions seem sufficient to ensure that dynamic shared memory
segments can't survive the postmaster itself short of a hard kill, and
that even after a hard kill we'll clean things up on a subsequent
postmaster startup. The other problem, of making sure that segments
get unmapped at the proper time, is solved using the resource owner
mechanism. There is an API to create a mapping which is
session-lifespan rather than resource-owner lifespan, but the default
is resource-owner lifespan, which I suspect will be right for common
uses. Thus, there are four separate occasions on which we remove
shared memory segments: (1) resource owner cleanup, (2) backend exit
(for any session-lifespan mappings and anything else that slips
through the cracks), (3) postmaster exit (in case a child dies without
cleaning itself up), and (4) postmaster startup (in case the
postmaster dies without cleaning up).

There are quite a few problems that this patch does not solve. First,
while it does give you a shared memory segment, it doesn't provide you
with any help at all in figuring out what to put in that segment. The
task of figuring out how to communicate usefully through shared memory
is thus, for the moment, left entirely to the application programmer.
While there may be cases where that's just right, I suspect there will
be a wider range of cases where it isn't, and I plan to work on some
additional facilities, sitting on top of this basic structure, next,
though probably as a separate patch. Second, it doesn't make any
policy decisions about what is sensible either in terms of number of
shared memory segments or the sizes of those segments, even though
there are serious practical limits in both cases. Actually, the total
number of segments system-wide is limited by the size of the control
segment, which is sized based on MaxBackends. But there's nothing to
keep a single backend from eating up all the slots, even though that's
pretty both unfriendly and unportable, and there's no real limit to
the amount of memory it can gobble up per slot, either. In other
words, it would be a bad idea to write a contrib module that exposes a
relatively uncooked version of this layer to the user.

But, just for testing purposes, I did just that. The attached patch
includes contrib/dsm_demo, which lets you say
dsm_demo_create('something') in one string, and if you pass the return
value to dsm_demo_read() in the same or another session during the
lifetime of the first session, you'll read back the same value you
saved. This is not, by any stretch of the imagination, a
demonstration of the right way to use this facility - but as a crude
unit test, it suffices. Although I'm including it in the patch file,
I would anticipate removing it before commit. Hopefully, with a
little more functionality on top of what's included here, we'll soon
be in a position to build something that might actually be useful to
someone, but this layer itself is a bit too impoverished to build
something really cool, at least not without more work than I wanted to
put in as part of the development of this patch.

Using that crappy contrib module, I verified that the POSIX, System V,
and mmap implementations all work on my MacBook Pro (OS X 10.8.4) and
on Linux (Fedora 16). I wouldn't like to have to wager on having
gotten all of the details right to be absolutely portable everywhere,
so I wouldn't be surprised to see this break on other systems.
Hopefully that will be a matter of adjusting the configure tests a bit
rather than coping with substantive changes in available
functionality, but we'll see.

Finally, I'd like to thank Noah Misch for a lot of discussion and
thought on that enabled me to make this patch much better than it
otherwise would have been. Although I didn't adopt Noah's preferred
solutions to all of the problems, and although there are probably
still some problems buried here, there would have been more if not for
his advice. I'd also like to thank the entire database server team at
EnterpriseDB for allowing me to dump large piles of work on them so
that I could work on this, and my boss, Tom Kincaid, for not allowing
other people to dump large piles of work on me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
dynshmem-v1.patch application/octet-stream 78.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-08-14 01:10:11 Re: Regarding BGworkers
Previous Message Craig Ringer 2013-08-14 00:32:41 Re: UNNEST with multiple args, and TABLE with multiple funcs