Re: [HACKERS] [POC] Faster processing at Gather node

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [POC] Faster processing at Gather node
Date: 2018-02-07 18:41:37
Message-ID: 20180207184137.5mv4bipj3t7hbncv@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2018-01-25 12:09:23 -0500, Robert Haas wrote:
> > Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
> > I suspect there won't be much of a performance impact, but it's probably
> > worth checking.
>
> I don't think I understand your concern here. If this is used on a
> system where we're emulating atomics and barriers in painful ways, it
> might hurt performance, but I think we have a policy of not caring.

Well, it's more than just systems like that - for 64bit atomics we
sometimes do fall back to spinlock based atomics on 32bit systems, even
if they support 32 bit atomics.

> Also, I don't know where I'd find a 32-bit system to test.

You can compile with -m32 on reasonable systems ;)

> >> Assert(used <= ringsize);
> >> available = Min(ringsize - used, nbytes - sent);
> >>
> >> /* Bail out if the queue has been detached. */
> >> - if (detached)
> >> + if (mq->mq_detached)
> >
> > Hm, do all paths here guarantee that mq->mq_detached won't be stored on
> > the stack / register in the first iteration, and then not reread any
> > further? I think it's fine because every branch of the if below ends up
> > in a syscall / barrier, but it might be worth noting on that here.
>
> Aargh. I hate compilers. I added a comment. Should I think about
> changing mq_detached to pg_atomic_uint32 instead?

I think a pg_compiler_barrier() would suffice to alleviate my concern,
right? If you wanted to go for an atomic, using pg_atomic_flag would
probably be more appropriate than pg_atomic_uint32.

> >> - /* Write as much data as we can via a single memcpy(). */
> >> + /*
> >> + * Write as much data as we can via a single memcpy(). Make sure
> >> + * these writes happen after the read of mq_bytes_read, above.
> >> + * This barrier pairs with the one in shm_mq_inc_bytes_read.
> >> + */
> >
> > s/above/above. Otherwise a newer mq_bytes_read could become visible
> > before the corresponding reads have fully finished./?
>
> I don't find that very clear. A newer mq_bytes_read could become
> visible whenever, and a barrier doesn't prevent that from happening.

Well, my point was that the barrier prevents the the write to
mq_bytes_read becoming visible before the corresponding reads have
finished. Which then would mean the memcpy() could overwrite them. And a
barrier *does* prevent that from happening.

I don't think this is the same as:

> What it does is ensure (together with the one in
> shm_mq_inc_bytes_read) that we don't try to read bytes that aren't
> fully *written* yet.

which seems much more about the barrier in shm_mq_inc_bytes_written()?

> Generally, my mental model is that barriers make things happen in
> program order rather than some other order that the CPU thinks would
> be fun. Imagine a single-core server doing all of this stuff the "old
> school" way. If the reader puts data into the queue before
> advertising its presence and the writer finishes using the data from
> the queue before advertising its consumption, then everything works.
> If you do anything else, it's flat busted, even on that single-core
> system, because a context switch could happen at any time, and then
> you might read data that isn't written yet. The barrier just ensures
> that we get that order of execution even on fancy modern hardware, but
> I'm not sure how much of that we really need to explain here.

IDK, I find it nontrivial to understand individual uses of
barriers. There's often multiple non isometric ways to use barriers, and
the logic why a specific one is correct isn't always obvious.

> >> + pg_memory_barrier();
> >> memcpy(&mq->mq_ring[mq->mq_ring_offset + offset],
> >> (char *) data + sent, sendnow);
> >> sent += sendnow;
> >
> > Btw, this mq_ring_offset stuff seems a bit silly, why don't we use
> > proper padding/union in the struct to make it unnecessary to do that bit
> > of offset calculation every time? I think it currently prevents
> > efficient address calculation instructions from being used.
>
> Well, the root cause -- aside from me being a fallible human being
> with only limited programing skills -- is that I wanted the parallel
> query code to be able to request whatever queue size it preferred
> without having to worry about how many bytes of that space was going
> to get consumed by overhead.

What I meant is that instead of doing
struct shm_mq
{
...
bool mq_detached;
uint8 mq_ring_offset;
char mq_ring[FLEXIBLE_ARRAY_MEMBER];
};

it'd be possible to do something like

{
...
bool mq_detached;
union {
char mq_ring[FLEXIBLE_ARRAY_MEMBER];
double forcealign;
} d;
};

which'd force the struct to be laid out so mq_ring is at a suitable
offset. We use that in a bunch of places.

As far as I understand that'd not run counter to your goals of:
> without having to worry about how many bytes of that space was going
> to get consumed by overhead.

right?

> change it up, if somebody felt like working out how the API should be
> set up. I don't really want to do that right now, though.

Right.

> >> From 666d33a363036a647dde83cb28b9d7ad0b31f76c Mon Sep 17 00:00:00 2001
> >> From: Robert Haas <rhaas(at)postgresql(dot)org>
> >> Date: Sat, 4 Nov 2017 19:03:03 +0100
> >> Subject: [PATCH 2/2] shm-mq-reduce-receiver-latch-set-v1
> >
> >> - /* Consume any zero-copy data from previous receive operation. */
> >> - if (mqh->mqh_consume_pending > 0)
> >> + /*
> >> + * If we've consumed an amount of data greater than 1/4th of the ring
> >> + * size, mark it consumed in shared memory. We try to avoid doing this
> >> + * unnecessarily when only a small amount of data has been consumed,
> >> + * because SetLatch() is fairly expensive and we don't want to do it
> >> + * too often.
> >> + */
> >> + if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)
> >> {
> >
> > Hm. Why are we doing this at the level of updating the variables, rather
> > than SetLatch calls?
>
> Hmm, I'm not sure I understand what you're suggesting, here. In
> general, we return with the data for the current message unconsumed,
> and then consume it the next time the function is called, so that
> (except when the message wraps the end of the buffer) we can return a
> pointer directly into the buffer rather than having to memcpy(). What
> this patch does is postpone consuming the data further, either until
> we can free up at least a quarter of the ring buffer or until we need
> to wait for more data. It seemed worthwhile to free up space in the
> ring buffer occasionally even if we weren't to the point of waiting
> yet, so that the sender has an opportunity to write new data into that
> space if it happens to still be running.

What I'm trying to suggest is that instead of postponing an update of
mq_bytes_read (by storing amount of already performed reads in
mqh_consume_pending), we continue to update mq_bytes_read but only set
the latch if your above thresholds are crossed. That way a burst of
writes can fully fill the ringbuffer, but the cost of doing a SetLatch()
is amortized. In my testing SetLatch() was the expensive part, not the
necessary barriers in shm_mq_inc_bytes_read().

- Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-02-07 18:42:54 Re: [HACKERS] Proposal: generic WAL compression
Previous Message Bossart, Nathan 2018-02-07 18:34:58 Re: [HACKERS] pg_upgrade to clusters with a different WAL segment size