Re: bg worker: patch 1 of 6 - permanent process

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bg worker: patch 1 of 6 - permanent process
Date: 2010-08-26 21:57:28
Message-ID: AANLkTi=has6d24oUT6qc201vE46mie4P_Sn4tnX0RBcC@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 26, 2010 at 3:03 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
>> On the more general topic of imessages, I had one other thought that
>> might be worth considering.  Instead of using shared memory, what
>> about using a file that is shared between the sender and receiver?
>
> What would that buy us? (At the price of more system calls and disk I/O)?
> Remember that the current approach (IIRC) uses exactly one syscall to send a
> message: kill() to send the (multiplexed) signal. (Except on strange
> platforms or setups that don't have a user-space spinlock implementation and
> need to use system mutexes).

It wouldn't require you to preallocate a big chunk of shared memory
without knowing how much of it you'll actually need. For example,
suppose we implement parallel query. If the message queues can be
allocated on the fly, then you can just say
maximum_message_queue_size_per_backend = 16MB and that'll probably be
good enough for most installations. On systems where parallel query
is not used (e.g. because they only have 1 or 2 processors) then it
costs nothing. On systems where parallel query is used extensively
(e.g. because they have 32 processors), you'll allocate enough space
for the number of backends that actually need message buffers, and not
more than that. Furthermore, if parallel query is used at some times
(say, for nightly reporting) but not others (say, for daily OLTP
queries), the buffers can be deallocated when the helper backends exit
(or paged out if they are idle), and that memory can be reclaimed for
other use.

In addition, it means that maximum_message_queue_size_per_backend (or
whatever it's called) can be changed on-the-fly; that is, it can be
PGC_SIGHUP rather than PGC_POSTMASTER. Being able to change GUCs
without shutting down the postmaster is a *big deal* for people
running in 24x7 operations. Even things like wal_level that aren't
apt to be changed more than once in a blue moon are a problem (once
you go from "not having a standby" to "having a standby", you're
unlikely to want to go backwards), and this would likely need more
tweaking. You might find that you need more memory for better
throughput, or that you need to reclaim memory for other purposes.
Especially if it's a hard allocation for any number of backends,
rather than something that backends can allocate only as and when they
need it.

As to efficiency, the process is not much different once the initial
setup is completed. Just because you write to a memory-mapped file
rather than a shared memory segment doesn't mean that you're
necessarily doing disk I/O. On systems that support it, you could
also choose to map a named POSIX shm rather than a disk file. Either
way, there might be a little more overhead at startup but that doesn't
seem so bad; presumably the amount of work that the worker is doing is
large compared to the overhead of a few system calls, or you're
probably in trouble anyway, since our process startup overhead is
pretty substantial already. The only time it seems like the overhead
would be annoying is if a process is going to use this system, but
only lightly. Doing the extra setup just to send one or two messages
might suck. But maybe that just means this isn't the right mechanism
for those cases (e.g. the existing XID-wraparound logic should still
use signal multiplexing rather than this system). I see the value of
this as being primarily for streaming big chunks of data, not so much
for sending individual, very short messages.

>> On
>> the other hand, for processes that only send and receive messages
>> occasionally, this might just be overkill (and overhead).  You'd be
>> just as well off wrapping the access to the file in an LWLock: the
>> reader takes the lock, reads the data, marks it read, and releases the
>> lock.  The writer takes the lock, writes data, and releases the lock.
>
> The current approach uses plain spinlocks, which are more efficient. Note
> that both, appending as well as removing from the queue are writing
> operations, from the point of view of the queue. So I don't think LWLocks
> buy you anything here, either.

I agree that this might not be useful. We don't really have all the
message types defined yet, though, so it's hard to say.

> I understand the need to limit the amount of data in flight, but I don't
> think that sending any type of message should ever block. Messages are
> atomic in that regard. Either they are ready to be delivered (in entirety)
> or not. Thus the sender needs to hold back the message, if the recipient is
> overloaded. (Also note that currently imessages are bound to a maximum size
> of around 8 KB).

That's functionally equivalent to blocking, isn't it? I think that's
just a question of what API you want to expose.

> It might be interesting to note that I've just implemented some kind of
> streaming mechanism *atop* of imessages for Postgres-R. A data stream gets
> fragmented into single messages. As you pointed out, there should be some
> kind of congestion control. However, in my case, that needs to cover the
> inter-node connection as well, not just imessages. So I think the solution
> to that problem needs to be found on a higher level. I.e. in the Postgres-R
> case, I want to limit the *overall* amount of recovery data that's pending
> for a certain node. Not just the amount that's pending on a certain stream
> of within the imessages system.

For replication, that might be the case, but for parallel query,
per-queue seems about right. At any rate, no design we've discussed
will let individual queues grow without bound.

> Think of imessages as the IP between processes, while streaming of data
> needs something akin to TCP on top of it. (OTOH, this comparison is lacking,
> because imessages guarantee reliable and ordered delivery of messages).

You probably need this, but 8KB seems like a pretty small chunk size.
I think one of the advantages of a per-backend area is that you don't
need to worry so much about fragmentation. If you only need in-order
message delivery, you can just use the whole thing as a big ring
buffer. There's no padding or sophisticated allocation needed. You
just need a pointer to the last byte read (P1), the last byte allowed
to be read (P2), and the last byte allocated (P3). Writers take a
spinlock, advance P3, release the spinlock, write the message, take
the spinlock, advance P2, release the spinlock, and signal the reader.
Readers take the spinlock, read P1 and P2, release the spinlock, read
the data, take the spinlock, advance P1, and release the spinlock.

You might still want to fragment chunks of data to avoid problems if,
say, two writers are streaming data to a single reader. In that case,
if the messages were too large compared to the amount of buffer space
available, you might get poor utilization, or even starvation. But I
would think you wouldn't need to worry about that until the message
size got fairly high.

> BTW: why do you think the data heavy messages are sensitive to concurrency
> problems? I found the control messages to be rather more sensitive, as state
> changes and timing for those control messages are trickier to deal with.

Well, what I was thinking about is the fact that data messages are
bigger. If I'm writing a 16-byte message once a minute and the reader
and I block each other until the message is fully read or written,
it's not really that big of a deal. If the same thing happens when
we're trying to continuously stream tuple data from one process to
another, it halves the throughput; we expect both processes to be
reading/writing almost constantly.

>> So I kind of wonder whether we ought to have two separate systems, one
>> for data and one for control, with someone different characteristics.
>> I notice that one of your bg worker patches is for OOO-messages.  I
>> apologize again for not having read through it, but how much does that
>> resemble separating the control and data channels?
>
> It's something that resides within the coordinator process exclusively and
> doesn't have much to do with imessages.

Oh, OK.

> As is evident, all of these decisions are rather Postgres-R centric.
> However, I still think the simplicity and the level of generalization of
> imessages, dynamic shared memory and to some extent even the background
> worker infrastructure makes these components potentionaly re-usable.

I think unicast messaging is really useful and I really want it, but
the requirement that it be done through dynamic shared memory
allocations feels very uncomfortable to me (as you've no doubt
gathered).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-08-26 21:58:56 Re: bg worker: patch 1 of 6 - permanent process
Previous Message Cristian Bittel 2010-08-26 20:59:59 Re: [BUGS] BUG #5305: Postgres service stops when closing Windows session