Re: [HACKERS] [POC] Faster processing at Gather node

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [POC] Faster processing at Gather node
Date: 2018-01-25 17:09:23
Message-ID: CA+Tgmob=kOVeaw3v-OCRPbmtK1qHhu3uvHV6ANxoB2xZuir_=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 9, 2018 at 7:09 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> + * mq_sender and mq_bytes_written can only be changed by the sender.
>> + * mq_receiver and mq_sender are protected by mq_mutex, although, importantly,
>> + * they cannot change once set, and thus may be read without a lock once this
>> + * is known to be the case.
>
> I don't recall our conversation around this anymore, and haven't read
> down far enough to see the relevant code. Lest I forget: Such construct
> often need careful use of barriers.

I think the only thing the code assumes here is that if we previously
read the value with the spinlock and didn't get NULL, we can later
read the value without the spinlock and count on seeing the same value
we saw previously. I think that's safe enough.

> s/should/is/ or similar?

I prefer it the way that I have it.

> Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
> I suspect there won't be much of a performance impact, but it's probably
> worth checking.

I don't think I understand your concern here. If this is used on a
system where we're emulating atomics and barriers in painful ways, it
might hurt performance, but I think we have a policy of not caring.

Also, I don't know where I'd find a 32-bit system to test.

>> - * Importantly, mq_ring can be safely read and written without a lock. Were
>> - * this not the case, we'd have to hold the spinlock for much longer
>> - * intervals, and performance might suffer. Fortunately, that's not
>> - * necessary. At any given time, the difference between mq_bytes_read and
>> + * At any given time, the difference between mq_bytes_read and
>
> Hm, why did you remove the first part about mq_ring itself?

Bad editing. Restored.

>> @@ -848,18 +868,19 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
>>
>> while (sent < nbytes)
>> {
>> - bool detached;
>> uint64 rb;
>> + uint64 wb;
>>
>> /* Compute number of ring buffer bytes used and available. */
>> - rb = shm_mq_get_bytes_read(mq, &detached);
>> - Assert(mq->mq_bytes_written >= rb);
>> - used = mq->mq_bytes_written - rb;
>> + rb = pg_atomic_read_u64(&mq->mq_bytes_read);
>> + wb = pg_atomic_read_u64(&mq->mq_bytes_written);
>> + Assert(wb >= rb);
>> + used = wb - rb;
>
> Just to make sure my understanding is correct: No barriers needed here
> because "bytes_written" is only written by the sending backend, and
> "bytes_read" cannot lap it. Correct?

We can't possibly read a stale value of mq_bytes_written because we
are the only process that can write it. It's possible that the
receiver has just increased mq_bytes_read and that the change isn't
visible to us yet, but if so, the sender's also going to set our
latch, or has done so already. So the worst thing that happens is
that we decide to sleep because it looks like no space is available
and almost immediately get woken up because there really is space.

>> Assert(used <= ringsize);
>> available = Min(ringsize - used, nbytes - sent);
>>
>> /* Bail out if the queue has been detached. */
>> - if (detached)
>> + if (mq->mq_detached)
>
> Hm, do all paths here guarantee that mq->mq_detached won't be stored on
> the stack / register in the first iteration, and then not reread any
> further? I think it's fine because every branch of the if below ends up
> in a syscall / barrier, but it might be worth noting on that here.

Aargh. I hate compilers. I added a comment. Should I think about
changing mq_detached to pg_atomic_uint32 instead?

> Perhaps mention that this could lead to spuriously signalling the wrong
> backend in case of detach, but that that is fine?

I think that's a general risk of latches that doesn't need to be
specifically recapitulated here.

> I know the scheme isn't new, but I do find it not immediately obvious
> that 'wb' is short for 'bytes_written'.

Sorry.

>> - /* Write as much data as we can via a single memcpy(). */
>> + /*
>> + * Write as much data as we can via a single memcpy(). Make sure
>> + * these writes happen after the read of mq_bytes_read, above.
>> + * This barrier pairs with the one in shm_mq_inc_bytes_read.
>> + */
>
> s/above/above. Otherwise a newer mq_bytes_read could become visible
> before the corresponding reads have fully finished./?

I don't find that very clear. A newer mq_bytes_read could become
visible whenever, and a barrier doesn't prevent that from happening.
What it does is ensure (together with the one in
shm_mq_inc_bytes_read) that we don't try to read bytes that aren't
fully *written* yet.

Generally, my mental model is that barriers make things happen in
program order rather than some other order that the CPU thinks would
be fun. Imagine a single-core server doing all of this stuff the "old
school" way. If the reader puts data into the queue before
advertising its presence and the writer finishes using the data from
the queue before advertising its consumption, then everything works.
If you do anything else, it's flat busted, even on that single-core
system, because a context switch could happen at any time, and then
you might read data that isn't written yet. The barrier just ensures
that we get that order of execution even on fancy modern hardware, but
I'm not sure how much of that we really need to explain here.

> Could you also add a comment as to why you think a read barrier isn't
> sufficient? IIUC that's the case because we need to prevent reordering
> in both directions: Can't neither start reading based on a "too new"
> bytes_read, nor can affort writes to mq_ring being reordered to before
> the barrier. Correct?

I can't parse that statement. We're separating from the read of
mqh_read_bytes from the write to mqh_ring. My understanding is that a
read barrier can separate two reads, a write barrier can separate two
writes, and a full barrier is needed to separate a write from a read
in either order. Added a comment to that effect.

>> + pg_memory_barrier();
>> memcpy(&mq->mq_ring[mq->mq_ring_offset + offset],
>> (char *) data + sent, sendnow);
>> sent += sendnow;
>
> Btw, this mq_ring_offset stuff seems a bit silly, why don't we use
> proper padding/union in the struct to make it unnecessary to do that bit
> of offset calculation every time? I think it currently prevents
> efficient address calculation instructions from being used.

Well, the root cause -- aside from me being a fallible human being
with only limited programing skills -- is that I wanted the parallel
query code to be able to request whatever queue size it preferred
without having to worry about how many bytes of that space was going
to get consumed by overhead. But it would certainly be possible to
change it up, if somebody felt like working out how the API should be
set up. I don't really want to do that right now, though.

>> From 666d33a363036a647dde83cb28b9d7ad0b31f76c Mon Sep 17 00:00:00 2001
>> From: Robert Haas <rhaas(at)postgresql(dot)org>
>> Date: Sat, 4 Nov 2017 19:03:03 +0100
>> Subject: [PATCH 2/2] shm-mq-reduce-receiver-latch-set-v1
>
>> - /* Consume any zero-copy data from previous receive operation. */
>> - if (mqh->mqh_consume_pending > 0)
>> + /*
>> + * If we've consumed an amount of data greater than 1/4th of the ring
>> + * size, mark it consumed in shared memory. We try to avoid doing this
>> + * unnecessarily when only a small amount of data has been consumed,
>> + * because SetLatch() is fairly expensive and we don't want to do it
>> + * too often.
>> + */
>> + if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)
>> {
>
> Hm. Why are we doing this at the level of updating the variables, rather
> than SetLatch calls?

Hmm, I'm not sure I understand what you're suggesting, here. In
general, we return with the data for the current message unconsumed,
and then consume it the next time the function is called, so that
(except when the message wraps the end of the buffer) we can return a
pointer directly into the buffer rather than having to memcpy(). What
this patch does is postpone consuming the data further, either until
we can free up at least a quarter of the ring buffer or until we need
to wait for more data. It seemed worthwhile to free up space in the
ring buffer occasionally even if we weren't to the point of waiting
yet, so that the sender has an opportunity to write new data into that
space if it happens to still be running.

Slightly revised patches attached. 0002 is unchanged except for being
made pgindent-clean.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-shm-mq-less-spinlocks-v3.patch application/octet-stream 14.2 KB
0002-shm-mq-reduce-receiver-latch-set-v2.patch application/octet-stream 7.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Luke Cowell 2018-01-25 17:26:02 Re: Possible performance regression with pg_dump of a large number of relations
Previous Message Tom Lane 2018-01-25 17:02:59 Re: Further cleanup of pg_dump/pg_restore item selection code