Re: bg worker: patch 1 of 6 - permanent process

From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: bg worker: patch 1 of 6 - permanent process
Date: 2010-08-30 09:06:02
Message-ID: 4C7B747A.8090502@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(Sorry, need to disable Ctrl-Return, which quite often sends mails
earlier than I really want.. continuing my mail)

On 08/27/2010 10:46 PM, Robert Haas wrote:
> Yeah, probably. I think designing something that works efficiently
> over a network is a somewhat different problem than designing
> something that works on an individual node, and we probably shouldn't
> let the designs influence each other too much.

Agreed. Thus I've left out any kind of congestion avoidance stuff from
imessages so far.

>>> There's no padding or sophisticated allocation needed. You
>>> just need a pointer to the last byte read (P1), the last byte allowed
>>> to be read (P2), and the last byte allocated (P3). Writers take a
>>> spinlock, advance P3, release the spinlock, write the message, take
>>> the spinlock, advance P2, release the spinlock, and signal the reader.
>>
>> That would block parallel writers (i.e. only one process can write to the
>> queue at any time).
>
> I feel like there's probably some variant of this idea that works
> around that problem. The problem is that when a worker finishes
> writing a message, he needs to know whether to advance P2 only over
> his own message or also over some subsequent message that has been
> fully written in the meantime. I don't know exactly how to solve that
> problem off the top of my head, but it seems like it might be
> possible.

I've tried pretty much that before. And failed. Because the
allocation-order (i.e. the time the message gets created in preparation
for writing to it) isn't necessarily the same as the sending-order (i.e.
when the process has finished writing and decides to send the message).

To satisfy the FIFO property WRT the sending order, you need to decouple
allocation form the ordering (i.e. queuing logic).

(And yes, it has taken me a while to figure out what's wrong in
Postgres-R, before I've even noticed about that design bug).

>>> Readers take the spinlock, read P1 and P2, release the spinlock, read
>>> the data, take the spinlock, advance P1, and release the spinlock.
>>
>> It would require copying data in case a process only needs to forward the
>> message. That's a quick pointer dequeue and enqueue exercise ATM.
>
> If we need to do that, that's a compelling argument for having a
> single messaging area rather than one per backend.

Absolutely, yes.

> But I'm not sure I
> see why we would need that sort of capability. Why wouldn't you just
> arrange for the sender to deliver the message directly to the final
> recipient?

A process can read and even change the data of the message before
forwarding it. Something the coordinator in Postgres-R does sometimes.
(As it is the interface to the GCS and thus to the rest of the nodes in
the cluster).

For parallel querying (on a single node) that's probably less important
a feature.

> So, they know in advance how large the message will be but not what
> the contents will be? What are they doing?

Filling the message until it's (mostly) full and then continue with the
next one. At least that's how the streaming approach on top of imessages
works.

But yes, it's somewhat annoying to have to know the message size in
advance. I didn't implement realloc so far. Nor can I think of any other
solution. Note that separation of allocation and queue ordering is
required anyway for the above reasons.

> Well, the fact that something is commonly used doesn't mean it's right
> for us. Tabula raza, we might design the whole system differently,
> but changing it now is not to be undertaken lightly. Hopefully the
> above comments shed some light on my concerns. In short, (1) I don't
> want to preallocate a big chunk of memory we might not use,

Isn't that's exactly what we do now for lots of sub-systems, and what
I'd like to improve (i.e. reduce to a single big chunk).

> (2) I fear
> reducing the overall robustness of the system, and

Well, that applies to pretty much every new feature you add.

> (3) I'm uncertain
> what other systems would be able leverage a dynamic allocator of the
> sort you propose.

Okay, that's up to me to show evidences (or at least a PoC).

Regards

Markus Wanner

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2010-08-30 09:18:49 Re: pg_subtrans keeps bloating up in the standby
Previous Message Markus Wanner 2010-08-30 08:47:14 Re: bg worker: patch 1 of 6 - permanent process