Re: bg worker: patch 1 of 6 - permanent process

From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bg worker: patch 1 of 6 - permanent process
Date: 2010-08-30 08:47:14
Message-ID: 4C7B7012.4050100@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 08/27/2010 10:46 PM, Robert Haas wrote:
> What other subsystems are you imagining servicing with a dynamic
> allocator? If there were a big demand for this functionality, we
> probably would have been forced to implement it already, but that's
> not the case. We've already discussed the fact that there are massive
> problems with using it for something like shared_buffers, which is by
> far the largest consumer of shared memory.

Understood. I certainly plan to look into that for a better
understanding of the problems those pose for dynamically allocated memory.

> I think it would be great if we could bring some more flexibility to
> our memory management. There are really two layers of problems here.

Full ACK.

> One is resizing the segment itself, and one is resizing structures
> within the segment. As far as I can tell, there is no portable API
> that can be used to resize the shm itself. For so long as that
> remains the case, I am of the opinion that any meaningful resizing of
> the objects within the shm is basically unworkable. So we need to
> solve that problem first.

Why should resizing of the objects within the shmem be unworkable?
Doesn't my patch(es) prove the exact opposite? Being able to resize
"objects" within the shm requires some kind of underlying dynamic
allocation. And I rather like to be in control of that allocator than
having to deal with two dozen different implementations on different
OSes and their libraries.

> There are a couple of possible solutions, which have been discussed
> here in the past.

I currently don't have much interest in dynamic resizing. Being able to
resize the overall amount of shared memory on the fly would be nice,
sure. But the total amount of RAM in a server changes rather
infrequently. Being able to use what's available more efficiently is
what I'm interested in. That doesn't need any kind of additional or
different OS level support. It's just a matter of making better use of
what's available - within Postgres itself.

> Next, we have to think about how we're going to resize data structures
> within this expandable shm.

Okay, that's where I'm getting interested.

> Many of these structures are not things
> that we can easily move without bringing the system to a halt. For
> example, it's difficult to see how you could change the base address
> of shared buffers without ceasing all system activity, at which point
> there's not really much advantage over just forcing a restart.
> Similarly with LWLocks or the ProcArray.

I guess that's what Bruce wanted to point out by saying our data
structures are mostly "continuous". I.e. not dynamic lists or hash
tables, but plain simple arrays.

Maybe that's a subjective impression, but I seem to hear complaints
about their fixed size and inflexibility quite often. Try to imagine the
flexibility that dynamic lists could give us.

> And if you can't move them,
> then how will you grow them if (as will likely be the case) there's
> something immediately following them in memory. One possible solution
> is to divide up these data structures into "slabs". For example, we
> might imagine allocating shared_buffers in 1GB chunks.

Why 1GB and do yet another layer of dynamic allocation within that? The
buffers are (by default) 8K, so allocate in chunks of 8K. Or a tiny bit
more for all of the book-keeping stuff.

> To make this
> work, we'd need to change the memory layout so that each chunk would
> include all of the miscellaneous stuff that we need to do bookkeeping
> for that chunk, such as the LWLocks and buffer descriptors. That
> doesn't seem completely impossible, but there would be some
> performance penalty, because you could no longer index into shared
> buffers from a single base offset.

AFAICT we currently have three fixed size blocks to manage shared
buffers: the buffer blocks themselves, the buffer descriptors, the
strategy status (for the freelist) and the buffer lookup table.

It's not obvious to me how these data structures should perform better
than a dynamically allocated layout. One could rather argue that
combining (some of) the bookkeeping stuff with data itself would lead to
better locality and thus perform better.

> Instead, you'd need to determine
> which chunk contains the buffer you want, look up the base address for
> that chunk, and then index into the chunk. Maybe that overhead
> wouldn't be significant (or maybe it would); at any rate, it's not
> completely free. There's also the problem of handling the partial
> chunk at the end, especially if that happens to be the only chunk.

This sounds way too complicated, yes. Use 8K chunks and most of the
problems vanish.

> I think the problems for other arrays are similar, or more severe. I
> can't see, for example, how you could resize the ProcArray using this
> approach.

Try not to think in terms of resizing, but dynamic allocation. I.e.
being able to resize ProcArray (and thus being able to alter
max_connections on the fly) would take a lot more work.

Just using the unoccupied space of the ProcArray for other subsystems
that need it more urgently could be done much easier. Again, you'd want
to allocate a single PGPROC at a time.

(And yes, the benefits aren't as significant as for shared_buffers,
simply because PGPROC doesn't occupy that much memory).

> If you want to deallocate a chunk of shared buffers, it's
> not impossible to imagine an algorithm for relocating any dirty
> buffers in the segment to be deallocated into the remaining available
> space, and then chucking the ones that are not dirty.

Please use the dynamic allocator for that. Don't duplicate that again.
Those allocators are designed for efficiently allocating small chunks,
down to a few bytes.

> It might not be
> real cheap, but that's not the same thing as not possible. On the
> other hand, changing the backend ID of a process in flight seems
> intractable. Maybe it's not. Or maybe there is some other approach
> to resizing these data structures that can work, but it's not real
> clear to me what it is.

Changing to a dynamically allocated memory model certainly requires some
thought and lots of work. Yes. It's not for free.

> So basically my feeling is that reworking our memory allocation in
> general, while possibly worthwhile, is a whole lot of work.

Exactly.

> If we
> focus on getting imessages done in the most direct fashion possible,
> it seems like the sort of things that could get done in six months to
> a year.

Well, it works for Postgres-R as it is, so imessages already exists
without a single additional month. And I don't intend to change it back
to something that couldn't use a dynamic allocator. I already run into
too many problems that way, see below.

> If we take the approach of reworking our whole approach to
> memory allocation first, I think it will take several years. Assuming
> the problems discussed above aren't totally intractable, I'd be in
> favor of solving them, because I think we can get some collateral
> benefits out of it that would be nice to have. However, it's
> definitely a much larger project.

Agreed.

> If the allocations are
> per-backend and can be made on the fly, that problem goes away.

That might hold true for imessages, which simply loose importance once
the (recipient) backend vanishes. But other shared memory stuff, that
would rather complicate shared memory access.

> As long as we keep the shared memory area used for imessages/dynamic
> allocation separate from, and independent of, the main shm, we can
> still gain many of the same advantages - in particular, not PANICing
> if a remap fails, and being able to resize the thing on the fly.

Separate sub-system allocators, separate code, separate bugs, lots more
work. Please not. KISS.

> However, I believe that the implementation will be more complex if the
> area is not per-backend. Resizing is almost certainly a necessity in
> this case, for the reasons discussed above

I disagree and see the main reason in making better use of the available
resources. Resizing will loose lots of importance, once you can
dynamically adjust boundaries between subsystem's use of the single,
huge, fixed-size shmem chunk allocated at start.

> and that will have to be
> done by having all backends unmap and remap the area in a coordinated
> fashion,

That's assuming resizing capability.

> so it will be more disruptive than unmapping and remapping a
> message queue for a single backend, where you only need to worry about
> the readers and writers for that particular queue.

And that's assuming a separate allocation method for the imessages
sub-system.

> Also, you now have
> to worry about fragmentation: a simple ring buffer is great if you're
> processing messages on a FIFO basis, but when you have multiple
> streams of messages with different destinations, it's probably not a
> great solution.

Exactly, that's where dynamic allocation shows its real advantages. No
silly ring buffers required.

> This goes back to my points further up: what else do you think this
> could be used for? I'm much less optimistic about this being reusable
> than you are, and I'd like to hear some concrete examples of other use
> cases.

Sure. And well understood. I'll try to take a try at converting
shared_buffers.

> Well, it's certainly nice, if you can make it work. I haven't really
> thought about all the cases, though. The main advantages of LWLocks
> is that you can take them in either shared or exclusive mode

As mentioned, the message queue has write accesses exclusively (enqueue
and dequeue), so that's unneeded overhead.

> and that
> you can hold them for more than a handful of instructions.

Neither of the two operations needs more than a handful of instructions,
so that's plain overhead as well.

> If we're
> trying to design a really *simple* system for message passing, LWLocks
> might be just right. Take the lock, read or write the message,
> release the lock.

That's exactly how easy is is *with* the dynamic allocator: take the
(even simpler) spin lock, enqueue (or dequeue) the message, release the
lock again.

No locking required for writing or reading the message. Independent (and
well multi-process capable / safe) alloc and free routines for memory
management. That get called *before* writing the message and *after*
reading it.

Mangling memory allocation with queue management is a lot more
complicated to design and understand. And less efficient

> But it seems like that's not really the case we're
> trying to optimize for, so this may be a dead-end.
>
>>> You probably need this, but 8KB seems like a pretty small chunk size.
>>
>> For node-internal messaging, I probably agree. Would need benchmarking, as
>> it's a compromise between latency and overhead, IMO.
>>
>> I've chosen 8KB so these messages (together with some GCS and other
>> transport headers) presumably fit into ethernet jumbo frames. I'd argue that
>> you'd want even smaller chunk sizes for 1500 byte MTUs, because I don't
>> expect the GCS to do a better job at fragmenting, than we can do in the
>> upper layer (i.e. without copying data and w/o additional latency when
>> reassembling the packet). But again, maybe that should be benchmarked,
>> first.
>
> Yeah, probably. I think designing something that works efficiently
> over a network is a somewhat different problem than designing
> something that works on an individual node, and we probably shouldn't
> let the designs influence each other too much.
>
>>> There's no padding or sophisticated allocation needed. You
>>> just need a pointer to the last byte read (P1), the last byte allowed
>>> to be read (P2), and the last byte allocated (P3). Writers take a
>>> spinlock, advance P3, release the spinlock, write the message, take
>>> the spinlock, advance P2, release the spinlock, and signal the reader.
>>
>> That would block parallel writers (i.e. only one process can write to the
>> queue at any time).
>
> I feel like there's probably some variant of this idea that works
> around that problem. The problem is that when a worker finishes
> writing a message, he needs to know whether to advance P2 only over
> his own message or also over some subsequent message that has been
> fully written in the meantime. I don't know exactly how to solve that
> problem off the top of my head, but it seems like it might be
> possible.
>
>>> Readers take the spinlock, read P1 and P2, release the spinlock, read
>>> the data, take the spinlock, advance P1, and release the spinlock.
>>
>> It would require copying data in case a process only needs to forward the
>> message. That's a quick pointer dequeue and enqueue exercise ATM.
>
> If we need to do that, that's a compelling argument for having a
> single messaging area rather than one per backend. But I'm not sure I
> see why we would need that sort of capability. Why wouldn't you just
> arrange for the sender to deliver the message directly to the final
> recipient?
>
>>> You might still want to fragment chunks of data to avoid problems if,
>>> say, two writers are streaming data to a single reader. In that case,
>>> if the messages were too large compared to the amount of buffer space
>>> available, you might get poor utilization, or even starvation. But I
>>> would think you wouldn't need to worry about that until the message
>>> size got fairly high.
>>
>> Some of the writers in Postgres-R allocate the chunk for the message in
>> shared memory way before they send the message. I.e. during a write
>> operation of a transaction that needs to be replicated, the backend
>> allocates space for a message at the start of the operation, but only fills
>> it with change set data during processing. That can possibly take quite a
>> while.
>
> So, they know in advance how large the message will be but not what
> the contents will be? What are they doing?
>
>>> I think unicast messaging is really useful and I really want it, but
>>> the requirement that it be done through dynamic shared memory
>>> allocations feels very uncomfortable to me (as you've no doubt
>>> gathered).
>>
>> Well, I on the other hand am utterly uncomfortable with having a separate
>> solution for memory allocation per sub-system (and it definitely is an
>> inherent problem to lots of our subsystems). Given the ubiquity of dynamic
>> memory allocators, I don't really understand your discomfort.
>
> Well, the fact that something is commonly used doesn't mean it's right
> for us. Tabula raza, we might design the whole system differently,
> but changing it now is not to be undertaken lightly. Hopefully the
> above comments shed some light on my concerns. In short, (1) I don't
> want to preallocate a big chunk of memory we might not use, (2) I fear
> reducing the overall robustness of the system, and (3) I'm uncertain
> what other systems would be able leverage a dynamic allocator of the
> sort you propose.
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Markus Wanner 2010-08-30 09:06:02 Re: bg worker: patch 1 of 6 - permanent process
Previous Message Peter Eisentraut 2010-08-30 08:19:05 Re: upcoming wraps