Re: [POC] Faster processing at Gather node

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [POC] Faster processing at Gather node
Date: 2017-11-04 16:55:33
Message-ID: 20171104165533.yahkmgxjqyrnvdvr@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-11-04 16:38:31 +0530, Robert Haas wrote:
> On hydra (PPC), these changes didn't help much. Timings:
>
> master: 29605.582, 29753.417, 30160.485
> patch: 28218.396, 27986.951, 26465.584
>
> That's about a 5-6% improvement. On my MacBook, though, the
> improvement was quite a bit more:

Hm. I wonder why that is. Random unverified theories (this plane doesn't
have power supplies for us mere mortals in coach, therefore I'm not
going to run benchmarks):

- Due to the lower per-core performance the leader backend is so
bottlenecked that there's just not a whole lot of
contention. Therefore removing the lock doesn't help much. That might
be a bit different if the redundant projection is removed.
- IO performance on hydra is notoriously bad so there might just not be
enough data available for workers to process rows fast enough to cause
contention.

> master: 21436.745, 20978.355, 19918.617
> patch: 15896.573, 15880.652, 15967.176
>
> Median-to-median, that's about a 24% improvement.

Neat!

> - * mq_detached can be set by either the sender or the receiver, so the mutex
> - * must be held to read or write it. Memory barriers could be used here as
> - * well, if needed.
> + * mq_bytes_read and mq_bytes_written are not protected by the mutex. Instead,
> + * they are written atomically using 8 byte loads and stores. Memory barriers
> + * must be carefully used to synchronize reads and writes of these values with
> + * reads and writes of the actual data in mq_ring.

Maybe mention that there's a fallback for ancient platforms?

> @@ -900,15 +921,12 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
> }
> else if (available == 0)
> {
> - shm_mq_result res;
> -
> - /* Let the receiver know that we need them to read some data. */
> - res = shm_mq_notify_receiver(mq);
> - if (res != SHM_MQ_SUCCESS)
> - {
> - *bytes_written = sent;
> - return res;
> - }
> + /*
> + * Since mq->mqh_counterparty_attached is known to be true at this
> + * point, mq_receiver has been set, and it can't change once set.
> + * Therefore, we can read it without acquiring the spinlock.
> + */
> + SetLatch(&mq->mq_receiver->procLatch);

Might not hurt to assert mqh_counterparty_attached, just for slightly
easier debugging.

> @@ -983,19 +1009,27 @@ shm_mq_receive_bytes(shm_mq *mq, Size bytes_needed, bool nowait,
> for (;;)
> {
> Size offset;
> - bool detached;
> + uint64 read;
>
> /* Get bytes written, so we can compute what's available to read. */
> - written = shm_mq_get_bytes_written(mq, &detached);
> - used = written - mq->mq_bytes_read;
> + written = pg_atomic_read_u64(&mq->mq_bytes_written);
> + read = pg_atomic_read_u64(&mq->mq_bytes_read);

Theoretically we don't actually need to re-read this from shared memory,
we could just have the information in the local memory too. Right?
Doubtful however that it's important enough to bother given that we've
to move the cacheline for `mq_bytes_written` anyway, and will later also
dirty it to *update* `mq_bytes_read`. Similarly on the write side.

> -/*
> * Increment the number of bytes read.
> */
> static void
> @@ -1157,63 +1164,51 @@ shm_mq_inc_bytes_read(volatile shm_mq *mq, Size n)
> {
> PGPROC *sender;
>
> - SpinLockAcquire(&mq->mq_mutex);
> - mq->mq_bytes_read += n;
> + /*
> + * Separate prior reads of mq_ring from the increment of mq_bytes_read
> + * which follows. Pairs with the full barrier in shm_mq_send_bytes().
> + * We only need a read barrier here because the increment of mq_bytes_read
> + * is actually a read followed by a dependent write.
> + */
> + pg_read_barrier();
> +
> + /*
> + * There's no need to use pg_atomic_fetch_add_u64 here, because nobody
> + * else can be changing this value. This method avoids taking the bus
> + * lock unnecessarily.
> + */

s/the bus lock/a bus lock/? Might also be worth rephrasing away from
bus locks - there's a lot of different ways atomics are implemented.

> /*
> - * Get the number of bytes written. The sender need not use this to access
> - * the count of bytes written, but the receiver must.
> - */
> -static uint64
> -shm_mq_get_bytes_written(volatile shm_mq *mq, bool *detached)
> -{
> - uint64 v;
> -
> - SpinLockAcquire(&mq->mq_mutex);
> - v = mq->mq_bytes_written;
> - *detached = mq->mq_detached;
> - SpinLockRelease(&mq->mq_mutex);
> -
> - return v;
> -}

You reference this function in a comment elsewhere:

> + /*
> + * Separate prior reads of mq_ring from the write of mq_bytes_written
> + * which we're about to do. Pairs with shm_mq_get_bytes_written's read
> + * barrier.
> + */
> + pg_write_barrier();

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-11-04 17:52:42 Re: [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple
Previous Message Tom Lane 2017-11-04 16:23:36 Re: [COMMITTERS] pgsql: Account for catalog snapshot in PGXACT->xmin updates.