Quick Links

Re: LWLock optimization for multicore Power machines

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: LWLock optimization for multicore Power machines
Date:	2017-04-03 18:56:13
Message-ID:	20170403185613.v452kjchgqtgquos@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2017-03-31 13:38:31 +0300, Alexander Korotkov wrote:
> > It seems that on this platform definition of atomics should be provided by
> > fallback.h. But it doesn't because I already defined PG_HAVE_ATOMIC_U32_SUPPORT
> > in arch-ppc.h. I think in this case we shouldn't provide ppc-specific
> > implementation of pg_atomic_fetch_mask_add_u32(). However, I don't know
> > how to do this assuming arch-ppc.h is included before compiler-specific
> > headers. Thus, in arch-ppc.h we don't know yet if we would find
> > implementation of atomics for this platform. One possible solution is to
> > provide assembly implementation for all atomics in arch-ppc.h.

That'd probably not be good use of effort.

> BTW, implementation for all atomics in arch-ppc.h would be too invasive and
> shouldn't be considered for v10.
> However, I made following workaround: declare pg_atomic_uint32 and
> pg_atomic_fetch_mask_add_u32_impl() only when we know that generic-gcc.h
> would declare gcc-based atomics.
> Could you, please, check it on Apple PPC?

That doesn't sound crazy to me.

> +/*
> + * pg_atomic_fetch_mask_add_u32 - atomically check that masked bits in variable
> + * and if they are clear then add to variable.

Hm, expanding on this wouldn't hurt.

> + * Returns the value of ptr before the atomic operation.

Sounds a bit like it'd return the pointer itself, not what it points to...

> + * Full barrier semantics.

Does it actually have full barrier semantics? Based on a quick skim I
don't think the ppc implementation doesn't?

> +#if defined(HAVE_ATOMICS) \
> + && (defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS))
> +
> +/*
> + * Declare pg_atomic_uint32 structure before generic-gcc.h does it in order to
> + * use it in function arguments.
> + */

If we do this, then the other side needs a pointer to keep this up2date.

> +#define PG_HAVE_ATOMIC_U32_SUPPORT
> +typedef struct pg_atomic_uint32
> +{
> + volatile uint32 value;
> +} pg_atomic_uint32;
> +
> +/*
> + * Optimized implementation of pg_atomic_fetch_mask_add_u32() for Power
> + * processors. Atomic operations on Power processors are implemented using
> + * optimistic locking. 'lwarx' instruction 'reserves index', but that
> + * reservation could be broken on 'stwcx.' and then we have to retry. Thus,
> + * each CAS operation is a loop. But loop of CAS operation is two level nested
> + * loop. Experiments on multicore Power machines shows that we can have huge
> + * benefit from having this operation done using single loop in assembly.
> + */
> +#define PG_HAVE_ATOMIC_FETCH_MASK_ADD_U32
> +static inline uint32
> +pg_atomic_fetch_mask_add_u32_impl(volatile pg_atomic_uint32 *ptr,
> + uint32 mask, uint32 increment)
> +{
> + uint32 result,
> + tmp;
> +
> + __asm__ __volatile__(
> + /* read *ptr and reserve index */
> +#ifdef USE_PPC_LWARX_MUTEX_HINT
> + " lwarx %0,0,%5,1 \n"
> +#else
> + " lwarx %0,0,%5 \n"
> +#endif

Hm, we should remove that hint bit stuff at some point...

> + " and %1,%0,%3 \n" /* calculate '*ptr & mask" */
> + " cmpwi %1,0 \n" /* compare '*ptr & mark' vs 0 */
> + " bne- $+16 \n" /* exit on '*ptr & mark != 0' */
> + " add %1,%0,%4 \n" /* calculate '*ptr + increment' */
> + " stwcx. %1,0,%5 \n" /* try to store '*ptr + increment' into *ptr */
> + " bne- $-24 \n" /* retry if index reservation is broken */
> +#ifdef USE_PPC_LWSYNC
> + " lwsync \n"
> +#else
> + " isync \n"
> +#endif
> + : "=&r"(result), "=&r"(tmp), "+m"(*ptr)
> + : "r"(mask), "r"(increment), "r"(ptr)
> + : "memory", "cc");
> + return result;
> +}

I'm not sure that the barrier semantics here are sufficient.

> +/*
> + * Generic implementation of pg_atomic_fetch_mask_add_u32() via loop
> + * of compare & exchange.
> + */
> +static inline uint32
> +pg_atomic_fetch_mask_add_u32_impl(volatile pg_atomic_uint32 *ptr,
> + uint32 mask_, uint32 add_)
> +{
> + uint32 old_value;
> +
> + /*
> + * Read once outside the loop, later iterations will get the newer value
> + * via compare & exchange.
> + */
> + old_value = pg_atomic_read_u32_impl(ptr);
> +
> + /* loop until we've determined whether we could make an increment or not */
> + while (true)
> + {
> + uint32 desired_value;
> + bool free;
> +
> + desired_value = old_value;
> + free = (old_value & mask_) == 0;
> + if (free)
> + desired_value += add_;
> +
> + /*
> + * Attempt to swap in the value we are expecting. If we didn't see
> + * masked bits to be clear, that's just the old value. If we saw them
> + * as clear, we'll attempt to make an increment. The reason that we
> + * always swap in the value is that this doubles as a memory barrier.
> + * We could try to be smarter and only swap in values if we saw the
> + * maked bits as clear, but benchmark haven't shown it as beneficial
> + * so far.
> + *
> + * Retry if the value changed since we last looked at it.
> + */
> + if (pg_atomic_compare_exchange_u32_impl(ptr, &old_value, desired_value))
> + return old_value;
> + }
> + pg_unreachable();
> +}
> +#endif

Have you done x86 benchmarking?

Greetings,

Andres Freund

In response to

Re: LWLock optimization for multicore Power machines at 2017-03-31 10:38:31 from Alexander Korotkov

Responses

Re: LWLock optimization for multicore Power machines at 2017-04-05 23:16:40 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2017-04-03 18:58:03	Re: Should we cacheline align PGXACT?
Previous Message	Tom Lane	2017-04-03 18:49:05	Re: Improve OR conditions on joined columns (common star schema problem)