Quick Links

Re: bg worker: patch 1 of 6 - permanent process

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: bg worker: patch 1 of 6 - permanent process
Date:	2010-08-27 20:46:30
Message-ID:	AANLkTinReOerwuGyQMe=3qKGpvT1PBLMK10ZuM_BsqDM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Aug 27, 2010 at 2:17 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
>> In addition, it means that maximum_message_queue_size_per_backend (or
>> whatever it's called) can be changed on-the-fly; that is, it can be
>> PGC_SIGHUP rather than PGC_POSTMASTER.
>
> That's certainly a point. However, as you are proposing a solution to just
> one subsystem (i.e. imessages), I don't find it half as convincing.

What other subsystems are you imagining servicing with a dynamic
allocator? If there were a big demand for this functionality, we
probably would have been forced to implement it already, but that's
not the case. We've already discussed the fact that there are massive
problems with using it for something like shared_buffers, which is by
far the largest consumer of shared memory.

> If you are saying it *should* be possible to resize shared memory in a
> portable way, why not do it for *all* subsystems right away? I still
> remember Tom saying it's not something that's doable in a portable way.

I think it would be great if we could bring some more flexibility to
our memory management. There are really two layers of problems here.
One is resizing the segment itself, and one is resizing structures
within the segment. As far as I can tell, there is no portable API
that can be used to resize the shm itself. For so long as that
remains the case, I am of the opinion that any meaningful resizing of
the objects within the shm is basically unworkable. So we need to
solve that problem first.

There are a couple of possible solutions, which have been discussed
here in the past. One very appealing option is to use POSIX shm
rather than sysv shm. AFAICT, it is possible to portably resize a
POSIX shm using ftruncate(), though I am not sure to what extent this
is supported on Windows. One significant advantage of using POSIX shm
is that the default limits for POSIX shm on many operating systems are
much higher than the corresponding limits for sysv shm; in fact, some
people have expressed the opinion that it might be worth making the
switch for that reason alone, since it is no secret that a default
value of 32MB or less for shared_buffers is not enough to get
reasonable performance on many modern systems. I believe, however,
that Tom Lane thinks we need to get a bit more out of it than that to
make it worthwhile. One obstacle to making the switch is that POSIX
shm does not provide a way to fetch the number of processes attached
to the shared memory segment, which is a critical part of our
infrastructure to prevent accidentally running multiple postmasters on
the same data directory at the same time. Consequently, it seems hard
to see how we can make that switch completely. At a minimum, we'll
probably need to maintain a small sysv shm for interlock purposes.

OK, so let's suppose we use POSIX shm for most of the shared memory
segment, and keep only our fixed-size data structures in the sysv shm.
Then what? Well, then we can potentially resize it. Because we are
using a process-based model, this will require some careful
gymnastics. Let's say we're growing the shm. The backend that is
initiating the operation will call ftruncate() and then send signal
all of the other backends (using a sinval message or a multiplexed
signal or some such mechanism) to unmap and remap the shared memory
segment. Any failure to remap the shared memory segment is at least a
FATAL for that backend, and very likely a PANIC, so this had better
not be something we plan to do routinely - for example, we wouldn't
want to do this as a way of adapting to changing load conditions. It
would probably be acceptable to do it in a situation such as a
postgresql.conf reload, to accommodate a change in the server
parameter that can't otherwise be changed without a restart, since the
worst case scenario is, well, we have to restart anyway. Once all
that's done, it's safe to start allocating memory from the newly added
portion of the shm. Conversely, if we want to shrink the shm, the
considerations are similar, but we have to do everything in the
opposite order. First, we must ensure that the portion of the shm
we're about to release is unused. Then, we tell all the backends to
unmap and remap it. Once we've confirmed that they have done so, we
ftruncate() it to the new size.

Next, we have to think about how we're going to resize data structures
within this expandable shm. Many of these structures are not things
that we can easily move without bringing the system to a halt. For
example, it's difficult to see how you could change the base address
of shared buffers without ceasing all system activity, at which point
there's not really much advantage over just forcing a restart.
Similarly with LWLocks or the ProcArray. And if you can't move them,
then how will you grow them if (as will likely be the case) there's
something immediately following them in memory. One possible solution
is to divide up these data structures into "slabs". For example, we
might imagine allocating shared_buffers in 1GB chunks. To make this
work, we'd need to change the memory layout so that each chunk would
include all of the miscellaneous stuff that we need to do bookkeeping
for that chunk, such as the LWLocks and buffer descriptors. That
doesn't seem completely impossible, but there would be some
performance penalty, because you could no longer index into shared
buffers from a single base offset. Instead, you'd need to determine
which chunk contains the buffer you want, look up the base address for
that chunk, and then index into the chunk. Maybe that overhead
wouldn't be significant (or maybe it would); at any rate, it's not
completely free. There's also the problem of handling the partial
chunk at the end, especially if that happens to be the only chunk.

I think the problems for other arrays are similar, or more severe. I
can't see, for example, how you could resize the ProcArray using this
approach. If you want to deallocate a chunk of shared buffers, it's
not impossible to imagine an algorithm for relocating any dirty
buffers in the segment to be deallocated into the remaining available
space, and then chucking the ones that are not dirty. It might not be
real cheap, but that's not the same thing as not possible. On the
other hand, changing the backend ID of a process in flight seems
intractable. Maybe it's not. Or maybe there is some other approach
to resizing these data structures that can work, but it's not real
clear to me what it is.

So basically my feeling is that reworking our memory allocation in
general, while possibly worthwhile, is a whole lot of work. If we
focus on getting imessages done in the most direct fashion possible,
it seems like the sort of things that could get done in six months to
a year. If we take the approach of reworking our whole approach to
memory allocation first, I think it will take several years. Assuming
the problems discussed above aren't totally intractable, I'd be in
favor of solving them, because I think we can get some collateral
benefits out of it that would be nice to have. However, it's
definitely a much larger project.

> Why
> and how should it be possible for a per-backend basis?

If we're designing a completely new subsystem, we have a lot more
design flexibility, because we needn't worry about interactions with
the existing users of shared memory. Resizing an arena that is only
used for imessages is a lot more straightforward than resizing the
main shared memory arena. If you can't remap the main shared memory
chunk, you won't be able to properly clean up your state while
exiting, and so a PANIC is forced. But if you can't remap the
imessages chunk, and particularly if it only contains messages that
were addressed to you, then you should be able to get by with FATAL,
which is certainly a good thing from a system robustness point of
view. And you might not even need to remap it. The main reason
(although perhaps not the only reason) that someone would likely want
to vary a global allocation for parallel query or replication is if
they changed from "not using that feature" to "using it", or perhaps
from "using it" to "using it more heavily". If the allocations are
per-backend and can be made on the fly, that problem goes away.

As long as we keep the shared memory area used for imessages/dynamic
allocation separate from, and independent of, the main shm, we can
still gain many of the same advantages - in particular, not PANICing
if a remap fails, and being able to resize the thing on the fly.
However, I believe that the implementation will be more complex if the
area is not per-backend. Resizing is almost certainly a necessity in
this case, for the reasons discussed above, and that will have to be
done by having all backends unmap and remap the area in a coordinated
fashion, so it will be more disruptive than unmapping and remapping a
message queue for a single backend, where you only need to worry about
the readers and writers for that particular queue. Also, you now have
to worry about fragmentation: a simple ring buffer is great if you're
processing messages on a FIFO basis, but when you have multiple
streams of messages with different destinations, it's probably not a
great solution.

> How portable is
> mmap() really? Why don't we use in in Postgres as of now?

I believe that mmap() is very portable, though there are other people
on this list who know more about exotic, crufty platforms than I do.
I discussed the question of why it's not used for our current shared
memory segment above - no nattch interlock.

>> As to efficiency, the process is not much different once the initial
>> setup is completed.
>
> I fully agree to that.
>
> I'm more concerned about ease of use for developers. Simply being able to
> alloc() from shared memory makes things easier than having to invent a
> separate allocation method for every subsystem, again and again (the
> argument that people are more used to multi-threaded argument).

This goes back to my points further up: what else do you think this
could be used for? I'm much less optimistic about this being reusable
than you are, and I'd like to hear some concrete examples of other use
cases.

>>> The current approach uses plain spinlocks, which are more efficient. Note
>>> that both, appending as well as removing from the queue are writing
>>> operations, from the point of view of the queue. So I don't think LWLocks
>>> buy you anything here, either.
>>
>> I agree that this might not be useful. We don't really have all the
>> message types defined yet, though, so it's hard to say.
>
> What does the type of lock used have to do with message types? IMO it
> doesn't matter what kind of message or what size you want to send. For
> appending or removing a pointer to or from a message queue, a spinlock seems
> to be just the right thing to use.

Well, it's certainly nice, if you can make it work. I haven't really
thought about all the cases, though. The main advantages of LWLocks
is that you can take them in either shared or exclusive mode, and that
you can hold them for more than a handful of instructions. If we're
trying to design a really *simple* system for message passing, LWLocks
might be just right. Take the lock, read or write the message,
release the lock. But it seems like that's not really the case we're
trying to optimize for, so this may be a dead-end.

>> You probably need this, but 8KB seems like a pretty small chunk size.
>
> For node-internal messaging, I probably agree. Would need benchmarking, as
> it's a compromise between latency and overhead, IMO.
>
> I've chosen 8KB so these messages (together with some GCS and other
> transport headers) presumably fit into ethernet jumbo frames. I'd argue that
> you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't
> expect the GCS to do a better job at fragmenting, than we can do in the
> upper layer (i.e. without copying data and w/o additional latency when
> reassembling the packet). But again, maybe that should be benchmarked,
> first.

Yeah, probably. I think designing something that works efficiently
over a network is a somewhat different problem than designing
something that works on an individual node, and we probably shouldn't
let the designs influence each other too much.

>> There's no padding or sophisticated allocation needed. You
>> just need a pointer to the last byte read (P1), the last byte allowed
>> to be read (P2), and the last byte allocated (P3). Writers take a
>> spinlock, advance P3, release the spinlock, write the message, take
>> the spinlock, advance P2, release the spinlock, and signal the reader.
>
> That would block parallel writers (i.e. only one process can write to the
> queue at any time).

I feel like there's probably some variant of this idea that works
around that problem. The problem is that when a worker finishes
writing a message, he needs to know whether to advance P2 only over
his own message or also over some subsequent message that has been
fully written in the meantime. I don't know exactly how to solve that
problem off the top of my head, but it seems like it might be
possible.

>> Readers take the spinlock, read P1 and P2, release the spinlock, read
>> the data, take the spinlock, advance P1, and release the spinlock.
>
> It would require copying data in case a process only needs to forward the
> message. That's a quick pointer dequeue and enqueue exercise ATM.

If we need to do that, that's a compelling argument for having a
single messaging area rather than one per backend. But I'm not sure I
see why we would need that sort of capability. Why wouldn't you just
arrange for the sender to deliver the message directly to the final
recipient?

>> You might still want to fragment chunks of data to avoid problems if,
>> say, two writers are streaming data to a single reader. In that case,
>> if the messages were too large compared to the amount of buffer space
>> available, you might get poor utilization, or even starvation. But I
>> would think you wouldn't need to worry about that until the message
>> size got fairly high.
>
> Some of the writers in Postgres-R allocate the chunk for the message in
> shared memory way before they send the message. I.e. during a write
> operation of a transaction that needs to be replicated, the backend
> allocates space for a message at the start of the operation, but only fills
> it with change set data during processing. That can possibly take quite a
> while.

So, they know in advance how large the message will be but not what
the contents will be? What are they doing?

>> I think unicast messaging is really useful and I really want it, but
>> the requirement that it be done through dynamic shared memory
>> allocations feels very uncomfortable to me (as you've no doubt
>> gathered).
>
> Well, I on the other hand am utterly uncomfortable with having a separate
> solution for memory allocation per sub-system (and it definitely is an
> inherent problem to lots of our subsystems). Given the ubiquity of dynamic
> memory allocators, I don't really understand your discomfort.

Well, the fact that something is commonly used doesn't mean it's right
for us. Tabula raza, we might design the whole system differently,
but changing it now is not to be undertaken lightly. Hopefully the
above comments shed some light on my concerns. In short, (1) I don't
want to preallocate a big chunk of memory we might not use, (2) I fear
reducing the overall robustness of the system, and (3) I'm uncertain
what other systems would be able leverage a dynamic allocator of the
sort you propose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In response to

Re: bg worker: patch 1 of 6 - permanent process at 2010-08-27 18:17:40 from Markus Wanner

Responses

Re: bg worker: patch 1 of 6 - permanent process at 2010-08-30 08:47:14 from Markus Wanner
Re: bg worker: patch 1 of 6 - permanent process at 2010-08-30 09:06:02 from Markus Wanner

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2010-08-27 21:28:05	Re: refactoring comment.c
Previous Message	Tom Lane	2010-08-27 20:21:08	Re: [HACKERS] HS/SR on AIX