Re: bg worker: patch 1 of 6 - permanent process

From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bg worker: patch 1 of 6 - permanent process
Date: 2010-08-27 18:17:40
Message-ID: 4C780144.8080407@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 08/26/2010 11:57 PM, Robert Haas wrote:
> It wouldn't require you to preallocate a big chunk of shared memory

Agreed, you wouldn't have to allocate it in advance. We would still want
a configurable upper limit. So this can be seen as another approach for
an implementation of a dynamic allocator. (Which should be separate from
the exact imessages implementation, just for the sake of modularization
already, IMO).

> In addition, it means that maximum_message_queue_size_per_backend (or
> whatever it's called) can be changed on-the-fly; that is, it can be
> PGC_SIGHUP rather than PGC_POSTMASTER.

That's certainly a point. However, as you are proposing a solution to
just one subsystem (i.e. imessages), I don't find it half as convincing.

If you are saying it *should* be possible to resize shared memory in a
portable way, why not do it for *all* subsystems right away? I still
remember Tom saying it's not something that's doable in a portable way.
Why and how should it be possible for a per-backend basis? How portable
is mmap() really? Why don't we use in in Postgres as of now?

I certainly think that these are orthogonal issues: whether to use fixed
boundaries or to dynamically allocate the memory available is one thing,
dynamic resizing is another. If the later is possible, I'm certainly not
opposed to it. (But would still favor dynamic allocation).

> As to efficiency, the process is not much different once the initial
> setup is completed.

I fully agree to that.

I'm more concerned about ease of use for developers. Simply being able
to alloc() from shared memory makes things easier than having to invent
a separate allocation method for every subsystem, again and again (the
argument that people are more used to multi-threaded argument).

> Doing the extra setup just to send one or two messages
> might suck. But maybe that just means this isn't the right mechanism
> for those cases (e.g. the existing XID-wraparound logic should still
> use signal multiplexing rather than this system). I see the value of
> this as being primarily for streaming big chunks of data, not so much
> for sending individual, very short messages.

I agree that simple signals don't need a full imessage. But as soon as
you want to send some data (like which database to vacuum), or require
the delivery guarantee (i.e. no single message gets lost, as opposed to
signals), then imessages should be cheap enough.

>> The current approach uses plain spinlocks, which are more efficient. Note
>> that both, appending as well as removing from the queue are writing
>> operations, from the point of view of the queue. So I don't think LWLocks
>> buy you anything here, either.
>
> I agree that this might not be useful. We don't really have all the
> message types defined yet, though, so it's hard to say.

What does the type of lock used have to do with message types? IMO it
doesn't matter what kind of message or what size you want to send. For
appending or removing a pointer to or from a message queue, a spinlock
seems to be just the right thing to use.

>> I understand the need to limit the amount of data in flight, but I don't
>> think that sending any type of message should ever block. Messages are
>> atomic in that regard. Either they are ready to be delivered (in entirety)
>> or not. Thus the sender needs to hold back the message, if the recipient is
>> overloaded. (Also note that currently imessages are bound to a maximum size
>> of around 8 KB).
>
> That's functionally equivalent to blocking, isn't it? I think that's
> just a question of what API you want to expose.

Hm.. well, yeah, depends on what level you are arguing. The imessages
API can be used in a completely non-blocking fashion. So a process can
theoretically do other work while waiting for messages.

For parallel querying, the helper/worker backends would probably need to
block, if the origin backend is not ready to accept more data, yes.
However, making it accept and process another job in the mean time seems
hard to do. But not an imessages problem per se. (While with the above
streaming layer I've mentioned, that would not be possible, because that
blocks).

> For replication, that might be the case, but for parallel query,
> per-queue seems about right. At any rate, no design we've discussed
> will let individual queues grow without bound.

Extend parallel querying to multiple nodes and you are back at the same
requirement.

However, it's certainly something that can be done atop imessages. I'm
unsure if doing it as part of imessages is a good thing or not. Given
the above requirement, I don't currently think so. Using multiple queues
with different priorities, as you proposed, would probably make it more
feasible.

> You probably need this, but 8KB seems like a pretty small chunk size.

For node-internal messaging, I probably agree. Would need benchmarking,
as it's a compromise between latency and overhead, IMO.

I've chosen 8KB so these messages (together with some GCS and other
transport headers) presumably fit into ethernet jumbo frames. I'd argue
that you'd want even smaller chunk sizes for 1500 byte MTUs, because I
don't expect the GCS to do a better job at fragmenting, than we can do
in the upper layer (i.e. without copying data and w/o additional latency
when reassembling the packet). But again, maybe that should be
benchmarked, first.

> I think one of the advantages of a per-backend area is that you don't
> need to worry so much about fragmentation. If you only need in-order
> message delivery, you can just use the whole thing as a big ring
> buffer.

Hm.. interesting idea. It's similar to my initial implementation, except
that I had only a single ring-buffer for all backends.

> There's no padding or sophisticated allocation needed. You
> just need a pointer to the last byte read (P1), the last byte allowed
> to be read (P2), and the last byte allocated (P3). Writers take a
> spinlock, advance P3, release the spinlock, write the message, take
> the spinlock, advance P2, release the spinlock, and signal the reader.

That would block parallel writers (i.e. only one process can write to
the queue at any time).

> Readers take the spinlock, read P1 and P2, release the spinlock, read
> the data, take the spinlock, advance P1, and release the spinlock.

It would require copying data in case a process only needs to forward
the message. That's a quick pointer dequeue and enqueue exercise ATM.

> You might still want to fragment chunks of data to avoid problems if,
> say, two writers are streaming data to a single reader. In that case,
> if the messages were too large compared to the amount of buffer space
> available, you might get poor utilization, or even starvation. But I
> would think you wouldn't need to worry about that until the message
> size got fairly high.

Some of the writers in Postgres-R allocate the chunk for the message in
shared memory way before they send the message. I.e. during a write
operation of a transaction that needs to be replicated, the backend
allocates space for a message at the start of the operation, but only
fills it with change set data during processing. That can possibly take
quite a while.

Decoupling memory allocation from message queue management allows to do
this without having to copy the data. The same holds true for forwarding
a message.

> Well, what I was thinking about is the fact that data messages are
> bigger. If I'm writing a 16-byte message once a minute and the reader
> and I block each other until the message is fully read or written,
> it's not really that big of a deal. If the same thing happens when
> we're trying to continuously stream tuple data from one process to
> another, it halves the throughput; we expect both processes to be
> reading/writing almost constantly.

Agreed. Unlike the proposed ring-buffer approach, the separate allocator
approach doesn't have that problem, because writing itself is fully
parallelized, even to the same recipient.

> I think unicast messaging is really useful and I really want it, but
> the requirement that it be done through dynamic shared memory
> allocations feels very uncomfortable to me (as you've no doubt
> gathered).

Well, I on the other hand am utterly uncomfortable with having a
separate solution for memory allocation per sub-system (and it
definitely is an inherent problem to lots of our subsystems). Given the
ubiquity of dynamic memory allocators, I don't really understand your
discomfort.

Thanks for discussing, I always enjoy respectful disagreement.

Regards

Markus Wanner

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-08-27 18:29:53 Re: Git conversion progress report and call for testing assistance
Previous Message Tom Lane 2010-08-27 17:28:40 Re: pg_subtrans keeps bloating up in the standby