Re: Dynamic Shared Memory stuff

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Dynamic Shared Memory stuff
Date: 2013-11-20 19:58:24
Message-ID: CA+TgmoZOLrXTpi-10w=mqzFwhHe8=a=oGrScN5QyBxu+LarXNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 20, 2013 at 8:32 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I'm trying to catch up on all of this dynamic shared memory stuff. A bunch
> of random questions and complaints:
>
> What kind of usage are we trying to cater with the dynamic shared memory?

Parallel sort, and then parallel other stuff. Eventually general
parallel query.

I have recently updated https://wiki.postgresql.org/wiki/Parallel_Sort
and you may find that interesting/helpful as a statement of intent.

> How many allocations? What size will they have have typically, minimum and
> maximum?

The facility is intended to be general, so the answer could vary
widely by application. The testing that I have done so far suggests
that for message-passing, relatively small queue sizes (a few kB,
perhaps 1 MB at the outside) should be sufficient. However,
applications such as parallel sort could require vast amounts of
shared memory. Consider a machine with 1TB of memory performing a
512GB internal sort. You're going to need 512GB of shared memory for
that.

> I looked at the message queue implementation you posted, but I
> wonder if that's the use case you're envisioning for this, or if you have
> more things in mind.

I consider that to be the first application of dynamic shared memory
and expect it to be used for (1) returning errors from background
workers to the user backend and (2) funneling tuples from one portion
of a query tree that has been split off to run in a background worker
back to the user backend. However, I expect that many clients of the
dynamic shared memory system will want to roll their own data
structures. Parallel internal sort (or external sort) is obviously
one, and in addition to that we might have parallel construction of
in-memory hash tables for a hash join or hash agg, or, well, anything
else you'd like to parallelize. I expect that many of those case will
result in much larger allocations than what we need just for message
passing.

> * dsm_handle is defined in dsm_impl.h, but it's exposed in the function
> signatures in dsm.h. ISTM it should be moved to dsm.h

Well, dsm_impl.h is the low-level stuff, and dsm.h is intended as the
user API. Unfortunately, whichever file declares that will have to be
included by the other one, so the separation is not as clean as I
would like, but I thought it made more sense for the high-level stuff
to depend on the low-level stuff rather than the other way around.

> * The DSM API contains functions for resizing the segment. That's not
> exercised by the MQ or TOC facilities. Is that going to stay dead code, or
> do you envision a user for it?

I dunno. It certainly seems like a useful thing to be able to do - if
we run out of memory, go get some more. It'd obviously be more useful
if we had a full-fledged allocator for dynamic shared memory, which is
something that I'd like to build but haven't built yet. However,
after discovering that it doesn't work either on Windows or with
System V shared memory, I'm less sanguine about the chances of finding
good uses for it. I haven't completely given up hope, but I don't
have anything concrete in mind at the moment. It'd be a little more
plausible if we adjusted things so that the mmap() implementation
works on Windows.

> * dsm_impl_can_resize() incorrectly returns false for DSM_IMPL_MMAP. The
> mmap() implementation can resize.

Oops, that's a bug.

> * This is an issue I've seen for some time with git master, while working on
> various things. Sometimes, when I kill the server with CTRL-C, I get this in
> the log:
>
> ^CLOG: received fast shutdown request
> LOG: aborting any active transactions
> FATAL: terminating connection due to administrator command
> LOG: autovacuum launcher shutting down
> LOG: shutting down
> LOG: database system is shut down
> LOG: could not remove shared memory segment "/PostgreSQL.1804289383":
> Tiedostoa tai hakemistoa ei ole
>
> (that means ENOENT)
>
> And I just figured out why that happens: If you take a base backup of a
> running system, the pg_dynshmem/state file is included in the backup. If you
> now start up a standby from the backup on the same system, it will "clean
> up" and reuse the dynshmem segment still used by the master system. Now,
> when you shut down the master, you get that message in the log. If the
> segment was actually used for something, the master would naturally crash.

Ooh. Well, pg_basebackup can be fixed not to copy that, but there's
still going to be a problem with old-style base backups. We could try
to figure out some additional sanity check for the dsm code to use, to
determine whether or not it belongs to the same cluster, like storing
the port number or the system identifier or some other value in the
shared memory segment and then comparing it to verify whether we've
got the same one. Or perhaps we could store the PID of the creating
postmaster in there and check whether that PID is still alive,
although we might get confused if the PID has been recycled.

> * As discussed in the "Something fishy happening on frogmouth" thread, I
> don't like the fact that the dynamic shared memory segments will be
> permanently leaked if you kill -9 postmaster and destroy the data directory.

Your test elicited different behavior for the dsm code vs. the main
shared memory segment because it involved running a new postmaster
with a different data directory but the same port number on the same
machine, and expecting that that new - and completely unrelated -
postmaster would clean up the resources left behind by the old,
now-destroyed cluster. I tend to view that as a defect in your test
case more than anything else, but as I suggested previously, we could
potentially change the code to use something like 1000000 + (port *
100) with a forward search for the control segment identifier, instead
of using a state file, mimicking the behavior of the main shared
memory segment. I'm not sure we ever reached consensus on whether
that was overall better than what we have now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-11-20 20:05:17 Re: Replication Node Identifiers and crashsafe Apply Progress
Previous Message Tom Lane 2013-11-20 18:52:58 Re: WITH ORDINALITY versus column definition lists