Re: update i386 spinlock for hyperthreading

From: Manfred Spraul <manfred(at)colorfullife(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: update i386 spinlock for hyperthreading
Date: 2003-12-27 10:34:16
Message-ID: 3FED6028.7020300@colorfullife.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers pgsql-patches

Tom Lane wrote:

>Manfred Spraul <manfred(at)colorfullife(dot)com> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>Don't you have to put it in a specific place in the loop to make that
>>>work? If not, why not?
>>>
>>>
>>>
>>Rep;nop is just a short delay - that's all.
>>
>>
>
>That view seems to me to be directly contradicted by this statement:
>
>
>
>>The PAUSE instruction provides a hint to the processor that
>>the code sequence is a spin-wait loop. The processor uses this hint to
>>avoid the memory order violation in most situations, which greatly
>>improves processor performance.
>>
>>
>
>It's not apparent to me how a short delay translates into avoiding a
>memory order violation (possibly some docs on what that means exactly
>might help...). I suspect strongly that there needs to be some near
>proximity between the PAUSE instruction and the lock-test instruction
>for this to work as advertised. It would help if Intel were less coy
>about what the instruction really does.
>
>
My guess: Pentium 4 cpu support something like 250 uops in flight - it
will have a dozend of the spinlock loops in it's pipeline. When the
spinlock is released, it must figure out which of the loops should get
it, and gets lost. My guess is that rep;nop delays the cpu buy at least
100 cpu ticks, and thus the pipeline will be empty before it proceeds. I
don't have a Pentium 4, and the HP testdrive is down. Someone around who
could run my test app?

>
>
>>This instruction does not change the architectural state of the
>>processor (that is, it performs essentially a delaying noop
>>operation).
>>
>>
>
>This can be rephrased as "we're not telling you what this instruction
>really does, because its interesting effects are below the level of the
>instruction set architecture". Great. How are we supposed to know
>how to use it?
>
>
There was a w_spinlock.pdf document with reference code. google still
finds it, but the links are dead :-(

>>I think a separate function is better than adding it into TAS: if it's
>>part of tas, then it would automatically be included by every
>>SpinLockAcquire call - unnecessary .text bloat.
>>
>>
>
>Why do you think it's unnecessary? One thing that I find particularly
>vague in the quoted documentation is the statement that the PAUSE
>instruction is needed to avoid a delay when *exiting* the spin-wait
>loop. Doesn't this mean that a PAUSE is needed in the success path
>when the first TAS succeeds (i.e, the normal no-contention path)?
>
IIRC: No.

>If not, why not? If so, does it go before or after the lock
>instruction?
>
>
Neither: somewhere in the failure path.

>Also, if the principal effect is a "short delay", do we really need it
>at all considering that our inner loop in s_lock is rather more than
>an "xchgb" followed by a conditional branch? There will be time for
>the write queue to drain while we're incrementing and testing our
>spin counter (which I trust is in a register...).
>
>The reason I'm so full of questions is that I spent some time several
>days ago looking at exactly this issue, and came away with only the
>conclusion that I had to find some more-detailed documentation before
>I could figure out what we should do about the spinlocks for Xeons.
>
I'll try to find some more docs and post links.

The 2nd thing I would change is to add a nonatomic test in the slow
path: locked instructions generate lots of bus traffic, and that's a
waste of resources.

Another question: regardless of the placement of rep;nop - 10% speedup
means that the postgres spends far too much time in the spinlock code.
I've looked at the oprofile dumps, and something like 1.2% of the total
cpu time is spent it the TAS macro in LWLockAcquire. That's the hottest
instruction in the whole profile, it eats more cpu cycles than the
memcpy() calls that transfer data to/from kernel.
Is there an easy way find out which LWLock is contended?
--
Manfred

Attachment Content-Type Size
rep_nop.cpp text/x-c++src 3.6 KB
patch-spinlock-i386 text/plain 2.4 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message ohp 2003-12-27 10:46:01 Re: update i386 spinlock for hyperthreading
Previous Message Joe Conway 2003-12-27 06:44:54 Re: Strange permission problem regarding pg_settings

Browse pgsql-hackers by date

  From Date Subject
Next Message ohp 2003-12-27 10:46:01 Re: update i386 spinlock for hyperthreading
Previous Message Joe Conway 2003-12-27 06:44:54 Re: Strange permission problem regarding pg_settings

Browse pgsql-patches by date

  From Date Subject
Next Message ohp 2003-12-27 10:46:01 Re: update i386 spinlock for hyperthreading
Previous Message Claudio Natoli 2003-12-27 10:27:27 Re: fork/exec patch: pre-CreateProcess finalization