atomics/arch-x86.h is stupider than atomics/generic-gcc.h?

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: atomics/arch-x86.h is stupider than atomics/generic-gcc.h?
Date: 2017-09-06 02:30:58
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

I spent some time trying to devise a suitable performance microbenchmark
for the atomic ops, in pursuit of whether the proposal at
is worth doing. I came up with the attached very simple-minded test
case, which you run with something like

create function my_test_atomic_ops(bigint) returns int
strict volatile language c as '/path/to/';


select my_test_atomic_ops(1000000000);

The performance of a single process running this is interesting, but
only mildly so: what we want to know about is what happens when you
run two or more calls concurrently.

On my primary server, dual quad-core Xeon E5-2609 @ 2.4GHz, RHEL6
(so gcc version 4.4.7 20120313 (Red Hat 4.4.7-18)), in a disable-cassert
build, I see that a single process running the 1G-iterations case
repeatably takes about 9600ms. Two competing processes take roughly
1 minute to do twice as much work. (The two processes tend to finish
at significantly different times, indicating that this box's method
for resolving bus conflicts isn't 100% fair. I'm taking the average
of the two runtimes as a representative number.)

This is with no source-code changes, meaning that what I'm testing is
arch-x86.h's version of pg_atomic_fetch_add_u32, which compiles to

xaddl %eax,(%rdx)

I then diked out that version, so that the build fell back to
generic-gcc.h's version of the function. With the test program
as attached, the inner loop is basically the same, and so is the
runtime. But what I was testing before that was a version that
ignored the result of pg_atomic_fetch_add_u32,

while (count-- > 0)
(void) pg_atomic_fetch_add_u32(myptr, 1);

and what I was quite surprised to see was a single-thread time of
9600ms and a two-thread time of ~40s. The reason was not too far
to seek: gcc is smart enough to notice that it doesn't need the
result of pg_atomic_fetch_add_u32, and so it compiles that to just

lock addl $1, (%rax)

which is evidently significantly more efficient than the xaddl under
contention load.

Or in words of one syllable: at least for pg_atomic_fetch_add_u32,
we are working hard in atomics/arch-x86.h to get worse code than
gcc would give us natively. (And, in case you didn't notice, this
is far from the latest and shiniest gcc.)

This case is not to be dismissed as insignificant either, since of the
three non-test occurrences of pg_atomic_fetch_add_u32 in our tree, two
ignore the result.

So I think we'd be well advised to cast a doubtful eye at the asm
constructs we've got here, and figure out which ones are really
meaningfully smarter than gcc's primitives.

regards, tom lane

Attachment Content-Type Size
atomic-perf-test.c text/x-c 925 bytes

Browse pgsql-hackers by date

  From Date Subject
Next Message Haribabu Kommi 2017-09-06 03:46:40 Re: pg_stat_wal_write statistics view
Previous Message Andres Freund 2017-09-06 02:07:07 Re: increasing the default WAL segment size