What is the real costs of LOCK on x86 multiprocesor machine?

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Mirek Fidl » Sat, 30 Jul 2005 22:54:06


I am unable to find anwswer to above question. Only tidbit of
information was that it is 100 cycles on P4, but perhaps that was just
the worst case.

Measuring the same on my uniprocesor AMD64 machine, LOCKed instructin
seem to be 3 times slower than regular one, but I guess that has only a
little relevance on real MP machine.

Mirek
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by chris noon » Sun, 31 Jul 2005 00:48:46


Surprisingly, on Intel Pentium processors the overhead of
a locked instruction is about the same for a single processor
as for one in a multiple processor configuration.

The way Microsoft (for example) avoid the performance hit is
to replace the LOCK prefixes in uniprocessor builds of their
operating system binaries with NOP instructions.

Chris

 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Mirek Fidl » Sun, 31 Jul 2005 01:36:31

>>Measuring the same on my uniprocesor AMD64 machine, LOCKed instructin

Hm, can you give me some numbers? Is it closer to my measurement (3
times slower than non-atomic) or to 100 cycles I have seen somewhere?


I stepped through InterlockedIncrement and there definitely IS lock.

Mirek
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Chris Thom » Sun, 31 Jul 2005 06:31:28

>> Surprisingly, on Intel Pentium processors the overhead of

There are too many variables to consider for any definitive answers. For
example, the negative effects of asserting a cpu's #LOCK signal could be
tremendous. It also tends to have different timing results on multi-cpu
systems because the #LOCK signal actually locks the entire system bus so
subsequent requests from any other cpu's are blocked. You can't really know
how many operations were blocked during a #LOCK signal assertion. Sometimes
the #LOCK signal assertion can be avoided if the shared data being accessed
is already cached in the cpu.

This variable behavior can lead to inconsistent timing results.

--
http://www.yqcomputer.com/
(portable lock-free data-structures)
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Sean Kell » Sun, 31 Jul 2005 06:50:31

For what it's worth, the IA-32 Developer's Manual has this to say about
LOCK:

"Beginning with the P6 family processors, when the LOCK prefix is
prefixed to an instruction
and the memory area being accessed is cached internally in the
processor, the LOCK# signal is generally not asserted. Instead, only
the processor's cache is locked. Here, the processor's cache
coherency mechanism insures that the operation is carried out
atomically with regards to memory."
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Oliver S » Sun, 31 Jul 2005 09:57:15

> I stepped through InterlockedIncrement and there definitely IS lock.

And there are also locks on Enter- and LeaveCriticalSection.
MS dropped the two versions of this functions win kernel32.dll
from NT4 to Win2K.
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by David Schw » Sun, 31 Jul 2005 13:17:09


On a p4, it's 100-200 clocks.

DS
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Mirek Fidl » Sun, 31 Jul 2005 18:18:34


Well, I guess that I should have been be more specific:

If accessed memory is contained exclusively in one CPU cache, is it true
that LOCK penalty is lower than 100-200 cycles?

My question is mostly about reference counted shared objects (e.g. in
C++). What I want to find out is whether penalty is present in all
cases, or whether when only single thread "owns" shared data (reference
count is not accessed by other threads), penalty is lower.

Unfortunately, at the moment I do not have hardware to test...

Mirek
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by David Schw » Sun, 31 Jul 2005 19:21:57


No.


It is.


Sadly, the penalty comes from the fact that the lock must take place
during the execution of a single instruction. This means that no other
operations can overlap that instruction, so the pipelines all empty, the
instruction is processed, then the pipelines fill again. This has a huge
cost on the p4.

DS
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Joe Seig » Sun, 31 Jul 2005 20:18:25


It doesn't serialize the processor, there are instructions that do this.
It "serializes" the memory accesses, i.e. they complete rather than just
get ordered. Instruction prefetching can still occur so the pipeline
isn't completely flushed. Self modifying code needs to use additional
synchronization.

"Locked operations are atomic with respect to all other memory operations and all externally
visible events. Only instruction fetch and page table accesses can pass locked instructions.
Locked instructions can be used to synchronize data written by one processor and read by
another processor.
For the P6 family processors, locked operations serialize all outstanding load and store operations
(that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon
processors, with one exception: load operations that reference weakly ordered memory types
(such as the WC memory type) may not be serialized."

I'm not sure that last bit means. Interlocked instructions aren't guaranteed to order
loads?


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by David Schw » Sun, 31 Jul 2005 20:19:19


It is not guaranteed to serialize the processor; however, my experience
has been that it in fact does this. Locked operations could be a lot more
optimized than they are, it just doesn't seem to have been a priority.

DS
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Joe Seig » Sun, 31 Jul 2005 20:28:38


You can measure cache hits. Just sequentially access more memory than can
fit in your cache. Brute force will work if you don't want to figure out
the n-way associative hashing and stuff.

You'll have to figure out the contention factor which would be dependent on
the number of cpus. If it's a problem there are more cache friendly algorithms
out there though it would be much better if the cache coherency protocols weren't
stuck back in the 70's and 80's as far as threaded programming was concerned.

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Joe Seig » Sun, 31 Jul 2005 20:52:52


Making the memory accesses complete degrades much of the pipeline even
if true serialization doesn't occur. Memory ordering would have been
much better though it still would have been a full memory barrier.
Serialization in my experience was used on mainframes to enforce hardware
error boundaries on things like context switching. In theory you could
use it for things like IPC without context switching (kernel calls) but
I've not seen any actual implementations myself. You need instructions
with cross address space support.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by Peter Dim » Sun, 31 Jul 2005 21:23:54


In practice, the actual amortized cost appears to be much lower than
50-100x. I see a 4x difference on shared_ptr_timing_test (which
basically tests reference count increments and decrements in a somewhat
real-life worst case scenario.)

A P4 can stall for 50 cycles without an apparent reason. ;-)
 
 
 

What is the real costs of LOCK on x86 multiprocesor machine?

Post by chris noon » Sun, 31 Jul 2005 21:51:26


I vaguely remember encountering the LOCK prefix
in one or two places in Microsoft OS code disassembly
(uniprocessor installation). I assumed it was simply
a mistake, something like new routines being hacked
in quickly without them taking the trouble to add
the offsets of the LOCK prefixes to the list to be
NOP'ed out.

Most of my work is done with NT4 SP4. If subsequent
versions retain the LOCK prefix in uniprocessor
mode that is interesting. Obviously Microsoft got
a measurable performance boost by removing the LOCKs
(I do too, in my code) or they wouldn't have taken
the trouble. By the way, the routines I studied most
were heap functions: NtAllocateHeap etc.


Chris