Re: Fix performance of generic atomics

From: Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fix performance of generic atomics
Date: 2017-05-25 13:39:22
Message-ID: 9fccff0670a2ec3c031d459564892f42@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A bit cleaner version of a patch.

Sokolov Yura писал 2017-05-25 15:22:
> Good day, everyone.
>
> I've been played with pgbench on huge machine.
> (72 cores, 56 for postgresql, enough memory to fit base
> both into shared_buffers and file cache)
> (pgbench scale 500, unlogged tables, fsync=off,
> synchronous commit=off, wal_writer_flush_after=0).
>
> With 200 clients performance is around 76000tps and main
> bottleneck in this dumb test is LWLockWaitListLock.
>
> I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
> (ie using __sync_fetch_and_or) and performance became 83000tps.
>
> It were a bit strange at a first look, cause __sync_fetch_and_or
> compiles to almost same CAS loop.
>
> Looking closely, I noticed that intrinsic performs doesn't do
> read in the loop body, but at loop initialization. It is correct
> behavior cause `lock cmpxchg` instruction stores old value in EAX
> register.
>
> It is expected behavior, and pg_compare_and_exchange_*_impl does
> the same in all implementations. So there is no need to re-read
> value in the loop body:
>
> Example diff for pg_atomic_exchange_u32_impl:
>
> static inline uint32
> pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32
> xchg_)
> {
> uint32 old;
> + old = pg_atomic_read_u32_impl(ptr);
> while (true)
> {
> - old = pg_atomic_read_u32_impl(ptr);
> if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
> break;
> }
> return old;
> }
>
> After applying this change to all generic atomic functions
> (and for pg_atomic_fetch_or_u32_impl ), performance became
> equal to __sync_fetch_and_or intrinsic.
>
> Attached patch contains patch for all generic atomic
> functions, and also __sync_fetch_and_(or|and) for gcc, cause
> I believe GCC optimize code around intrinsic better than
> around inline assembler.
> (final performance is around 86000tps, but difference between
> 83000tps and 86000tps is not so obvious in NUMA system).
>
> With regards,

--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

Attachment Content-Type Size
0001-Fix-performance-of-Atomics-generic-implementation.patch text/x-diff 5.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tushar 2017-05-25 13:43:46 No parameter values checking while creating Alter subscription...Connection
Previous Message Michael Paquier 2017-05-25 13:32:21 Re: Server ignores contents of SASLInitialResponse