Re: bg worker: patch 1 of 6 - permanent process

From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bg worker: patch 1 of 6 - permanent process
Date: 2010-08-26 19:03:03
Message-ID: 4C76BA67.6090907@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert,

On 08/26/2010 02:44 PM, Robert Haas wrote:
> I dunno. It was just a thought. I haven't actually looked at the
> code to see how much synergy there is. (Sorry, been really busy...)

No problem, was just wondering if there's any benefit you had in mind.

> On the more general topic of imessages, I had one other thought that
> might be worth considering. Instead of using shared memory, what
> about using a file that is shared between the sender and receiver?

What would that buy us? (At the price of more system calls and disk
I/O)? Remember that the current approach (IIRC) uses exactly one syscall
to send a message: kill() to send the (multiplexed) signal. (Except on
strange platforms or setups that don't have a user-space spinlock
implementation and need to use system mutexes).

> So
> for example, perhaps each receiver will read messages from a file
> called pg_messages/%d, where %d is the backend ID. And writers will
> write into that file. Perhaps both readers and writers mmap() the
> file, or perhaps there's a way to make it work with just read() and
> write(). If you actually mmap() the file, you could probably manage
> it in a fashion pretty similar to what you had in mind for wamalloc,
> or some other setup that minimizes locking.

That would still require proper locking, then. So I'm not seeing the
benefit.

> In particular, ISTM that
> if we want this to be usable for parallel query, we'll want to be able
> to have one process streaming data in while another process streams
> data out, with minimal interference between these two activities.

That's well possible with the current approach. About the only
limitation is that a receiver can only consume the messages in the order
they got into the queue. But pretty much any backend can send messages
to any other backend concurrently.

(Well, except that I think there currently are bugs in wamalloc).

> On
> the other hand, for processes that only send and receive messages
> occasionally, this might just be overkill (and overhead). You'd be
> just as well off wrapping the access to the file in an LWLock: the
> reader takes the lock, reads the data, marks it read, and releases the
> lock. The writer takes the lock, writes data, and releases the lock.

The current approach uses plain spinlocks, which are more efficient.
Note that both, appending as well as removing from the queue are writing
operations, from the point of view of the queue. So I don't think
LWLocks buy you anything here, either.

> It almost seems to me that there are two different kinds of messages
> here: control messages and data messages. Control messages are things
> like "vacuum this database!" or "flush your cache!" or "execute this
> query and send the results to backend %d!" or "cancel the currently
> executing query!". They are relatively small (in some cases,
> fixed-size), relatively low-volume, don't need complex locking, and
> can generally be processed serially but with high priority. Data
> messages are streams of tuples, either from a remote database from
> which we are replicating, or between backends that are executing a
> parallel query. These messages may be very large and extremely
> high-volume, are very sensitive to concurrency problems, but are not
> high-priority. We want to process them as quickly as possible, of
> course, but the work may get interrupted by control messages. Another
> point is that it's reasonable, at least in the case of parallel query,
> for the action of sending a data message to *block*. If one part of
> the query is too far ahead of the rest of the query, we don't want to
> queue up results forever, perhaps using CPU or I/O resources that some
> other backend needs to catch up, exhausting available disk space, etc.

I agree that such a thing isn't currently covered. And it might be
useful. However, adding two separate queues with different priority
would be very simple to do. (Note, however, that there already are the
standard unix signals for very simple kinds of control signals. I.e. for
aborting a parallel query, you could simply send SIGINT to all
background workers involved).

I understand the need to limit the amount of data in flight, but I don't
think that sending any type of message should ever block. Messages are
atomic in that regard. Either they are ready to be delivered (in
entirety) or not. Thus the sender needs to hold back the message, if the
recipient is overloaded. (Also note that currently imessages are bound
to a maximum size of around 8 KB).

It might be interesting to note that I've just implemented some kind of
streaming mechanism *atop* of imessages for Postgres-R. A data stream
gets fragmented into single messages. As you pointed out, there should
be some kind of congestion control. However, in my case, that needs to
cover the inter-node connection as well, not just imessages. So I think
the solution to that problem needs to be found on a higher level. I.e.
in the Postgres-R case, I want to limit the *overall* amount of recovery
data that's pending for a certain node. Not just the amount that's
pending on a certain stream of within the imessages system.

Think of imessages as the IP between processes, while streaming of data
needs something akin to TCP on top of it. (OTOH, this comparison is
lacking, because imessages guarantee reliable and ordered delivery of
messages).

BTW: why do you think the data heavy messages are sensitive to
concurrency problems? I found the control messages to be rather more
sensitive, as state changes and timing for those control messages are
trickier to deal with.

> So I kind of wonder whether we ought to have two separate systems, one
> for data and one for control, with someone different characteristics.
> I notice that one of your bg worker patches is for OOO-messages. I
> apologize again for not having read through it, but how much does that
> resemble separating the control and data channels?

It's something that resides within the coordinator process exclusively
and doesn't have much to do with imessages. Postgres-R doesn't require
the GCS to deliver (certain kind of) messages in any order, it only
requires the GCS to guarantee reliability of message delivery (or
notification in the form of excluding the failing node from the group in
case delivery failed).

Thus, the coordinator needs to be able to re-order the messages, because
bg workers need to receive the change sets in the correct order. And
imessages guarantees to maintain the ordering.

The reason for doing this within the coordinator is to a) lower
requirements for the GCS and b) gain more control of the data flow. I.e.
congestion control gets much easier, if the coordinator knows the amount
of data that's queued. (As opposed to having lots of TCP connections,
each of which queues an unknown amount of data).

As is evident, all of these decisions are rather Postgres-R centric.
However, I still think the simplicity and the level of generalization of
imessages, dynamic shared memory and to some extent even the background
worker infrastructure makes these components potentionaly re-usable.

Regards

Markus Wanner

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2010-08-26 19:07:57 Re: Unable to drop role
Previous Message Tom Lane 2010-08-26 18:59:43 Re: Unable to drop role