Re: [HACKERS] [POC] Faster processing at Gather node

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [POC] Faster processing at Gather node
Date: 2018-01-10 00:09:34
Message-ID: 20180110000934.lft7ebeqek4grrlx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-12-04 10:50:53 -0500, Robert Haas wrote:
> Subject: [PATCH 1/2] shm-mq-less-spinlocks-v2

> + * mq_sender and mq_bytes_written can only be changed by the sender.
> + * mq_receiver and mq_sender are protected by mq_mutex, although, importantly,
> + * they cannot change once set, and thus may be read without a lock once this
> + * is known to be the case.

I don't recall our conversation around this anymore, and haven't read
down far enough to see the relevant code. Lest I forget: Such construct
often need careful use of barriers.

> - * mq_detached can be set by either the sender or the receiver, so the mutex
> - * must be held to read or write it. Memory barriers could be used here as
> - * well, if needed.
> + * mq_bytes_read and mq_bytes_written are not protected by the mutex. Instead,
> + * they are written atomically using 8 byte loads and stores. Memory barriers
> + * must be carefully used to synchronize reads and writes of these values with
> + * reads and writes of the actual data in mq_ring.
> + *
> + * mq_detached needs no locking. It can be set by either the sender or the
> + * receiver, but only ever from false to true, so redundant writes don't
> + * matter. It is important that if we set mq_detached and then set the
> + * counterparty's latch, the counterparty must be certain to see the change
> + * after waking up. Since SetLatch begins with a memory barrier and ResetLatch
> + * ends with one, this should be OK.

s/should/is/ or similar?

Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
I suspect there won't be much of a performance impact, but it's probably
worth checking.

> * mq_ring_size and mq_ring_offset never change after initialization, and
> * can therefore be read without the lock.
> *
> - * Importantly, mq_ring can be safely read and written without a lock. Were
> - * this not the case, we'd have to hold the spinlock for much longer
> - * intervals, and performance might suffer. Fortunately, that's not
> - * necessary. At any given time, the difference between mq_bytes_read and
> + * At any given time, the difference between mq_bytes_read and

Hm, why did you remove the first part about mq_ring itself?

> @@ -848,18 +868,19 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
>
> while (sent < nbytes)
> {
> - bool detached;
> uint64 rb;
> + uint64 wb;
>
> /* Compute number of ring buffer bytes used and available. */
> - rb = shm_mq_get_bytes_read(mq, &detached);
> - Assert(mq->mq_bytes_written >= rb);
> - used = mq->mq_bytes_written - rb;
> + rb = pg_atomic_read_u64(&mq->mq_bytes_read);
> + wb = pg_atomic_read_u64(&mq->mq_bytes_written);
> + Assert(wb >= rb);
> + used = wb - rb;

Just to make sure my understanding is correct: No barriers needed here
because "bytes_written" is only written by the sending backend, and
"bytes_read" cannot lap it. Correct?

> Assert(used <= ringsize);
> available = Min(ringsize - used, nbytes - sent);
>
> /* Bail out if the queue has been detached. */
> - if (detached)
> + if (mq->mq_detached)

Hm, do all paths here guarantee that mq->mq_detached won't be stored on
the stack / register in the first iteration, and then not reread any
further? I think it's fine because every branch of the if below ends up
in a syscall / barrier, but it might be worth noting on that here.

> + /*
> + * Since mq->mqh_counterparty_attached is known to be true at this
> + * point, mq_receiver has been set, and it can't change once set.
> + * Therefore, we can read it without acquiring the spinlock.
> + */
> + Assert(mqh->mqh_counterparty_attached);
> + SetLatch(&mq->mq_receiver->procLatch);

Perhaps mention that this could lead to spuriously signalling the wrong
backend in case of detach, but that that is fine?

> /* Skip manipulation of our latch if nowait = true. */
> if (nowait)
> @@ -934,10 +953,18 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
> }
> else
> {
> - Size offset = mq->mq_bytes_written % (uint64) ringsize;
> - Size sendnow = Min(available, ringsize - offset);
> + Size offset;
> + Size sendnow;
> +
> + offset = wb % (uint64) ringsize;
> + sendnow = Min(available, ringsize - offset);

I know the scheme isn't new, but I do find it not immediately obvious
that 'wb' is short for 'bytes_written'.

> - /* Write as much data as we can via a single memcpy(). */
> + /*
> + * Write as much data as we can via a single memcpy(). Make sure
> + * these writes happen after the read of mq_bytes_read, above.
> + * This barrier pairs with the one in shm_mq_inc_bytes_read.
> + */

s/above/above. Otherwise a newer mq_bytes_read could become visible
before the corresponding reads have fully finished./?

Could you also add a comment as to why you think a read barrier isn't
sufficient? IIUC that's the case because we need to prevent reordering
in both directions: Can't neither start reading based on a "too new"
bytes_read, nor can affort writes to mq_ring being reordered to before
the barrier. Correct?

> + pg_memory_barrier();
> memcpy(&mq->mq_ring[mq->mq_ring_offset + offset],
> (char *) data + sent, sendnow);
> sent += sendnow;

Btw, this mq_ring_offset stuff seems a bit silly, why don't we use
proper padding/union in the struct to make it unnecessary to do that bit
of offset calculation every time? I think it currently prevents
efficient address calculation instructions from being used.

> From 666d33a363036a647dde83cb28b9d7ad0b31f76c Mon Sep 17 00:00:00 2001
> From: Robert Haas <rhaas(at)postgresql(dot)org>
> Date: Sat, 4 Nov 2017 19:03:03 +0100
> Subject: [PATCH 2/2] shm-mq-reduce-receiver-latch-set-v1

> - /* Consume any zero-copy data from previous receive operation. */
> - if (mqh->mqh_consume_pending > 0)
> + /*
> + * If we've consumed an amount of data greater than 1/4th of the ring
> + * size, mark it consumed in shared memory. We try to avoid doing this
> + * unnecessarily when only a small amount of data has been consumed,
> + * because SetLatch() is fairly expensive and we don't want to do it
> + * too often.
> + */
> + if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)
> {

Hm. Why are we doing this at the level of updating the variables, rather
than SetLatch calls?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Sharma 2018-01-10 00:51:21 Re: [HACKERS] dead or outdated URLs found in win32.h
Previous Message Tom Lane 2018-01-09 23:55:44 Re: Add RANGE with values and exclusions clauses to the Window Functions