Re: Dynamic Shared Memory stuff

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Dynamic Shared Memory stuff
Date: 2013-12-05 16:12:48
Message-ID: 52A0A600.5080805@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/20/2013 09:58 PM, Robert Haas wrote:
> On Wed, Nov 20, 2013 at 8:32 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> How many allocations? What size will they have have typically, minimum and
>> maximum?
>
> The facility is intended to be general, so the answer could vary
> widely by application. The testing that I have done so far suggests
> that for message-passing, relatively small queue sizes (a few kB,
> perhaps 1 MB at the outside) should be sufficient. However,
> applications such as parallel sort could require vast amounts of
> shared memory. Consider a machine with 1TB of memory performing a
> 512GB internal sort. You're going to need 512GB of shared memory for
> that.

Hmm. Those two use cases are quite different. For message-passing, you
want a lot of small queues, but for parallel sort, you want one huge
allocation. I wonder if we shouldn't even try a one-size-fits-all solution.

For message-passing, there isn't much need to even use dynamic shared
memory. You could just assign one fixed-sized, single-reader
multiple-writer queue for each backend.

For parallel sort, you'll want to utilize all the available memory and
all CPUs for one huge sort. So all you really need is a single huge
shared memory segment. If one process is already using that 512GB
segment to do a sort, you do *not* want to allocate a second 512GB
segment. You'll want to wait for the first operation to finish first. Or
maybe you'll want to have 3-4 somewhat smaller segments in use at the
same time, but not more than that.

>> * As discussed in the "Something fishy happening on frogmouth" thread, I
>> don't like the fact that the dynamic shared memory segments will be
>> permanently leaked if you kill -9 postmaster and destroy the data directory.
>
> Your test elicited different behavior for the dsm code vs. the main
> shared memory segment because it involved running a new postmaster
> with a different data directory but the same port number on the same
> machine, and expecting that that new - and completely unrelated -
> postmaster would clean up the resources left behind by the old,
> now-destroyed cluster. I tend to view that as a defect in your test
> case more than anything else, but as I suggested previously, we could
> potentially change the code to use something like 1000000 + (port *
> 100) with a forward search for the control segment identifier, instead
> of using a state file, mimicking the behavior of the main shared
> memory segment. I'm not sure we ever reached consensus on whether
> that was overall better than what we have now.

I really think we need to do something about it. To use your earlier
example of parallel sort, it's not acceptable to permanently leak a 512
GB segment on a system with 1 TB of RAM.

One idea is to create the shared memory object with shm_open, and wait
until all the worker processes that need it have attached to it. Then,
shm_unlink() it, before using it for anything. That way the segment will
be automatically released once all the processes close() it, or die. In
particular, kill -9 will release it. (This is a variant of my earlier
idea to create a small number of anonymous shared memory file
descriptors in postmaster startup with shm_open(), and pass them down to
child processes with fork()). I think you could use that approach with
SysV shared memory as well, by destroying the segment with
sgmget(IPC_RMID) immediately after all processes have attached to it.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Kreen 2013-12-05 16:16:11 Re: Feature request: Logging SSL connections
Previous Message Tom Lane 2013-12-05 16:11:18 Re: Performance optimization of btree binary search