Quick Links

Fix performance of generic atomics

From:	Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Fix performance of generic atomics
Date:	2017-05-25 12:22:03
Message-ID:	7f65886daca545067f82bf2b463b218d@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Good day, everyone.

I've been played with pgbench on huge machine.
(72 cores, 56 for postgresql, enough memory to fit base
both into shared_buffers and file cache)
(pgbench scale 500, unlogged tables, fsync=off,
synchronous commit=off, wal_writer_flush_after=0).

With 200 clients performance is around 76000tps and main
bottleneck in this dumb test is LWLockWaitListLock.

I added gcc specific implementation for pg_atomic_fetch_or_u32_impl
(ie using __sync_fetch_and_or) and performance became 83000tps.

It were a bit strange at a first look, cause __sync_fetch_and_or
compiles to almost same CAS loop.

Looking closely, I noticed that intrinsic performs doesn't do
read in the loop body, but at loop initialization. It is correct
behavior cause `lock cmpxchg` instruction stores old value in EAX
register.

It is expected behavior, and pg_compare_and_exchange_*_impl does
the same in all implementations. So there is no need to re-read
value in the loop body:

Example diff for pg_atomic_exchange_u32_impl:

static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32
xchg_)
{
uint32 old;
+ old = pg_atomic_read_u32_impl(ptr);
while (true)
{
- old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break;
}
return old;
}

After applying this change to all generic atomic functions
(and for pg_atomic_fetch_or_u32_impl ), performance became
equal to __sync_fetch_and_or intrinsic.

Attached patch contains patch for all generic atomic
functions, and also __sync_fetch_and_(or|and) for gcc, cause
I believe GCC optimize code around intrinsic better than
around inline assembler.
(final performance is around 86000tps, but difference between
83000tps and 86000tps is not so obvious in NUMA system).

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

Responses

Re: Fix performance of generic atomics at 2017-05-25 13:39:22 from Sokolov Yura
Re: Fix performance of generic atomics at 2017-05-25 14:52:14 from Aleksander Alekseev

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2017-05-25 12:32:58	Re: pg_dump ignoring information_schema tables which used in Create Publication.
Previous Message	Andrew Borodin	2017-05-25 11:31:04	Allow GiST opcalsses without compress\decompres functions