| From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | Sokolov Yura <funny(dot)falcon(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: Fix performance of generic atomics | 
| Date: | 2017-09-06 17:17:18 | 
| Message-ID: | 13707.1504718238@sss.pgh.pa.us | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> It's not a question of whether the return value is used, but of
> whether the updated value of *old is used.
Right, but if we re-read "old" in the loop, and if the primitive
doesn't return "old" (or does, but call site ignores it) then in
principle the CAS might be strength-reduced a bit.  Whether that
can win enough to be better than removing the unlocked read,
I dunno.
In the case at hand, looking at a loop like
	while (count-- > 0)
	{
		(void) pg_atomic_fetch_or_u32(myptr, 1);
	}
with our HEAD code and RHEL6's gcc I get this for the inner loop:
.L9:
	movl	(%rdx), %eax
	movl	%eax, %ecx
	orl	$1, %ecx
	lock				
	cmpxchgl	%ecx,(%rdx)	
	setz		%cl		
	testb	%cl, %cl
	je	.L9
	subq	$1, %rbx
	testq	%rbx, %rbx
	jg	.L9
Applying the proposed generic.h patch, it becomes
.L10:
	movl	(%rsi), %eax
.L9:
	movl	%eax, %ecx
	orl	$1, %ecx
	lock				
	cmpxchgl	%ecx,(%rdx)	
	setz		%cl		
	testb	%cl, %cl
	je	.L9
	subq	$1, %rbx
	testq	%rbx, %rbx
	jg	.L10
Note that in both cases the cmpxchgl is coming out of the asm construct in
pg_atomic_compare_exchange_u32_impl from atomics/arch-x86.h, so that even
if a simpler assembly instruction were possible given that we don't need
%eax to be updated, there's no chance of that actually happening.  This
gets back to the point I made in the other thread that maybe the
arch-x86.h asm sequences are not an improvement over the gcc intrinsics.
The code is the same if the loop is modified to use the result of
pg_atomic_fetch_or_u32, so I won't bother showing that.
Adding the proposed generic-gcc.h patch, the loop reduces to
.L11:
	lock orl	$1, (%rax)
	subq	$1, %rbx
	testq	%rbx, %rbx
	jg	.L11
or if we make the loop do
		junk += pg_atomic_fetch_or_u32(myptr, 1);
then we get
.L13:
	movl	(%rsi), %eax
.L10:
	movl	%eax, %edi
	movl	%eax, %ecx
	orl	$1, %ecx
	lock cmpxchgl	%ecx, (%rdx)
	jne	.L10
	addl	%edi, %r8d
	subq	$1, %rbx
	testq	%rbx, %rbx
	jg	.L13
So that last is slightly better than the generic.h + asm CAS version,
perhaps not meaningfully so --- but it's definitely better when
the compiler can see the result isn't used.
In short then, given the facts on the ground here, in particular the
asm-based CAS primitive at the heart of these loops, it's clear that
taking the read out of the loop can't hurt.  But that should be read
as a rather narrow conclusion.  With a different compiler and/or a
different version of pg_atomic_compare_exchange_u32_impl, maybe the
answer would be different.  And of course it's moot once the
generic-gcc.h patch is applied.
Anyway, I don't have a big objection to applying this.  My concern
is more that we need to be taking a harder look at other parts of
the atomics infrastructure, because tweaks there are likely to buy
much more.
regards, tom lane
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2017-09-06 17:27:36 | Re: ALTER INDEX .. SET STATISTICS ... behaviour | 
| Previous Message | Simon Riggs | 2017-09-06 16:59:20 | Re: ALTER INDEX .. SET STATISTICS ... behaviour |