shared memory message queues

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: shared memory message queues
Date: 2013-10-31 16:21:31
Message-ID: CA+TgmobUe28JR3zRUDH7s0jkCcdxsw6dP4sLw57x9NnMf01wgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Right now, it's pretty hard to write code that does anything useful
with dynamic shared memory. Sure, you can allocate a dynamic shared
memory segment, and that's nice, but you won't get any help at all
figuring out what to store in it, or how to use it to communicate
effectively, which is not so nice. And some of the services we offer
around the main shared memory segment are conspicuously missing for
dynamic shared memory. The attached patches attempt to rectify some
of these problems. If you're not the patient type who wants to read
the whole email, patch #3 is the cool part.

Patch #1, on-dsm-detach-v1.patch, adds the concept of on_dsm_detach
hooks. These are basically like on_shmem_exit hooks, except that
detaching from a dsm can happen at any time, not just at backend exit.
But they're needed for the same reasons: when we detach from the main
shared memory segment, we need to make sure that we've released all
relevant locks, returned our PGPROC to the pool, etc. Dynamic shared
memory segments require the same sorts of cleanup when they contain
similarly complex data structures. The part of this patch which I
suppose will elicit some controversy is that I've had to rearrange
on_shmem_exit a bit. It turns out that during shmem_exit, we do
"user-level" cleanup, like aborting the transaction, first. We expect
that will probably release all of our shared-memory resources. Then,
just to make doubly sure, we do "low-level cleanup", where individual
modules return session-lifetime resources and make doubly sure that no
lwlocks, etc. have been leaked. on_dsm_exit callbacks properly happen
in the middle, after we've tried to abort the transaction but before
the main shared memory segment is finally shut down. I'm not sure
that the solution I've adopted here is optimal; see within for
details.

Patch #2, shm-toc-v1.patch, provides a facility for sizing a dynamic
shared memory segment before creation, and for dividing it up into
chunks after it's been created. It therefore serves a function quite
similar to RequestAddinShmemSpace, except of course that there is only
one main shared memory segment created at postmaster startup time,
whereas new dynamic shared memory segments can come into existence on
the fly; and it serves even more conspicuously the function of
ShmemIndex, which enables backends to locate particular data
structures within the shared memory segment. It is however quite a
bit simpler than the ShmemIndex mechanism: we don't need the same
level of extensibility here that we do for the main shared memory
segment, because a new extension need not piggyback on an existing
dynamic shared memory segment, but can create a whole segment of its
own.

Patch #3, shm-mq-v1.patch, is the heart of this series. It creates an
infrastructure for sending and receiving messages of arbitrary length
using ring buffers stored in shared memory (presumably dynamic shared
memory, but hypothetically the main shared memory segment could be
used). Queues are single-reader and single-writer; they use process
latches to implement waiting for the queue to fill (in the case of the
reader) or drain (in the case of the writer). A non-blocking mode is
also available for situations where other options might lead to
deadlock. Even without this patch, backends can write messages to a
dynamic shared memory segment and wait for some other backend to read
them, but unless you know exactly how much data you want to send
before you create the shared memory segment, and unless you don't mind
storing all of it for the lifetime of the segment, you'll quickly run
into non-trivial problems around memory reuse and synchronization. So
this is an effort to create a higher-level infrastructure where one
process can simply declare that it wishes to a send series of messages
to a particular queue and another process can declare that it wishes
to read them out of that queue, and so it happens.

As far as parallelism is concerned, I anticipate that this code will
be useful for at least two purposes: (1) propagating errors that occur
inside a worker process back to the user backend that initiated the
parallel operation; and (2) streaming tuples from a worker performing
one part of the query (a scan or join, say) back to the user backend
or another worker performing a different part of the same query. I
suspect that this code will find applications outside parallelism as
well.

Patch #4, test-shm-mq-v1.patch, is a demonstration of how to use the
various background worker and dynamic shared memory facilities
introduced over the course of the 9.4 release cycle, and the
facilities introduced by patches #1-#3 of this series, to actually do
something interesting. Specifically, it sets up a ring of processes
connected by shared message queues and relays a user-specified message
around the ring repeatedly, then checks that it has the same message
at the end. This is obviously just a demonstration, but I find it
pretty cool, because the code here demonstrates that, with all of
these facilities in place, setting up a bunch of workers and having
them talk to each other can be done using what is really a pretty
modest amount of code. Importantly, this patch shows how to make the
start-up and shut-down sequences reliable, so that you don't end up
with the user backend hanging forever waiting for a worker that has
already died or will never start, or a worker backend waiting for a
user backend that has already aborted. Review of this logic is
particularly appreciated, as it's proven to be pretty complex: I think
the solutions I've worked out here are generally good, but there may
still be holes to plug. My hope is that people will take this test
code and use it as a basis for real applications. Including this
patch in our distribution will also serve as a useful regression test
of dynamic background workers and dynamic shared memory, which has so
far been lacking.

Particular thanks are due to Noah Misch for serving as my constant
sounding board during the development of this patch series.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
on-dsm-detach-v1.patch text/x-patch 27.2 KB
shm-toc-v1.patch text/x-patch 9.6 KB
shm-mq-v1.patch text/x-patch 27.4 KB
test-shm-mq-v1.patch text/x-patch 30.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-10-31 16:52:41 API bug in DetermineTimeZoneOffset()
Previous Message Joe Love 2013-10-31 16:04:57 Feature request: optimizer improvement