SyntaxHighlighter

Tuesday, May 26, 2015

Volatile and memory barriers

I have already blogged on the effect of a volatile variable on optimization performed by the JIT. But what is the really difference between a regular variable ? And what are the impacts in terms of performance ?

Semantic


Volatile has a well defined semantic in Java Memory Model (JMM), but to summarize, it has the following:
  • Cannot be reordered
  • Ensure data visibility to other threads

Visibility


Doug Lea refers thread visibility to flushing caches, however, as pointed out by Martin Thompson in his post, volatile access does not flush the cache for visibility (as in writing data to memory to make it visible for all cores).
ccNUMA architecture means that all data in cache subsystem are in fact coherent with main memory. So the semantic of volatile apply through the Load/Store buffers (or Memory Ordering Buffers) placed between registers and L1 cache.

Depending on CPU architecture/instructions set, generated instructions to ensure those properties can vary. Let's focus on x86 which is the most wide spread.

Reordering


For reordering there is 2 kinds:
  • Compiler
  • Hardware/CPU
Compiler is able to reorder instructions during instruction scheduling to match cost of loading or storing data with the CPU specifications.
It could be interesting to trigger 2 loads without dependencies between each other in order to optimize the time spent by the CPU to wait for those data and perform other operations in the mean time.

In the following example I have put 6 non-volatile fields :

public class TestJIT
{
    private static int field1;
    private static int field2;
    private static int field3;
    private static int field4;
    private static int field5;
    private static int field6;
    
    private static void assign(int i)
    {
        field1 = i << 1;
        field2 = i << 2;
        field3 = i << 3;
        field4 = i << 4;
        field5 = i << 5;
        field6 = i << 6;
    }

    public static void main(String[] args) throws Exception
    {
        for (int i = 0; i < 10000; i++)
        {
            assign(i);
        }
        Thread.sleep(1000);
    }
}


Let's examine what is generated by the JIT for the method assign:

  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x02438800: push   ebp
  0x02438801: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::assign@-1 (line 26)
  0x02438807: mov    ebx,ecx
  0x02438809: shl    ebx,1
  0x0243880b: mov    edx,ecx
  0x0243880d: shl    edx,0x2
  0x02438810: mov    eax,ecx
  0x02438812: shl    eax,0x3
  0x02438815: mov    esi,ecx
  0x02438817: shl    esi,0x4
  0x0243881a: mov    ebp,ecx
  0x0243881c: shl    ebp,0x5
  0x0243881f: shl    ecx,0x6
  0x02438822: mov    edi,0x160
  0x02438827: mov    DWORD PTR [edi+0x565c3b0],ebp
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@27 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x0243882d: mov    ebp,0x164
  0x02438832: mov    DWORD PTR [ebp+0x565c3b0],ecx
                                        ;*putstatic field6
                                        ; - com.bempel.sandbox.TestJIT::assign@34 (line 31)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438838: mov    ebp,0x150
  0x0243883d: mov    DWORD PTR [ebp+0x565c3b0],ebx
                                        ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438843: mov    ecx,0x154
  0x02438848: mov    DWORD PTR [ecx+0x565c3b0],edx
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@9 (line 27)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x0243884e: mov    ebx,0x158
  0x02438853: mov    DWORD PTR [ebx+0x565c3b0],eax
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@15 (line 28)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438859: mov    ecx,0x15c
  0x0243885e: mov    DWORD PTR [ecx+0x565c3b0],esi
                                        ;*putstatic field4
                                        ; - com.bempel.sandbox.TestJIT::assign@21 (line 29)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438864: add    esp,0x8
  0x02438867: pop    ebp
  0x02438868: test   DWORD PTR ds:0x190000,eax
                                        ;   {poll_return}
  0x0243886e: ret    


As you can see in the comments, the order of field assignation is the following: field5, field6, field1, field2, field3.
We now add volatile modifier for field1 & field6:
  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x024c8800: push   ebp
  0x024c8801: sub    esp,0x8
  0x024c8807: mov    ebp,ecx
  0x024c8809: shl    ebp,1
  0x024c880b: mov    edx,ecx
  0x024c880d: shl    edx,0x2
  0x024c8810: mov    esi,ecx
  0x024c8812: shl    esi,0x3
  0x024c8815: mov    eax,ecx
  0x024c8817: shl    eax,0x4
  0x024c881a: mov    ebx,ecx
  0x024c881c: shl    ebx,0x5
  0x024c881f: shl    ecx,0x6
  0x024c8822: mov    edi,0x150
  0x024c8827: mov    DWORD PTR [edi+0x562c3b0],ebp
                                        ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c882d: mov    ebp,0x160
  0x024c8832: mov    DWORD PTR [ebp+0x562c3b0],ebx
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@27 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8838: mov    ebx,0x154
  0x024c883d: mov    DWORD PTR [ebx+0x562c3b0],edx
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@9 (line 27)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8843: mov    ebp,0x158
  0x024c8848: mov    DWORD PTR [ebp+0x562c3b0],esi
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@15 (line 28)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c884e: mov    ebx,0x15c
  0x024c8853: mov    DWORD PTR [ebx+0x562c3b0],eax
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8859: mov    ebp,0x164
  0x024c885e: mov    DWORD PTR [ebp+0x562c3b0],ecx
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8864: lock add DWORD PTR [esp],0x0  ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
  0x024c8869: add    esp,0x8
  0x024c886c: pop    ebp
  0x024c886d: test   DWORD PTR ds:0x180000,eax
                                        ;   {poll_return}
  0x024c8873: ret    

Now, field1 is really the first field assigned and field 6 is last one. But in the middle we have the following order: field5, field2, field3, field4.

Beside that, CPU is able to reorder the instruction flow in certain circumstances to also optimize the efficiency of instruction execution. Those properties are well summarized in the Intel white paper Memory Ordering.

Write access

From previous example which is only volatile writes you can notice there is a special instruction: lock add. This is a bit weird at first, we add 0 to the memory pointed by SP (Stack Pointer) register. So we are not changing the data at this memory location but with lock prefix, memory related instructions are processed specially to act as a memory barrier, similar to mfence instruction. Dave Dice explains in his blog that they benchmarked the 2 kinds of barrier, and the lock add seems the most efficient one on today's architecture.

So this barrier ensures that there is no reordering before and after this instruction and also drains all instructions pending into Store Buffer. After executing these instructions, all writes are visible to all other threads through cache subsystem or main memory. This costs some latency to wait for this drain.

LazySet

Some time we can relax the constraint on immediate visibility but still keep the ordering guarantee. For this reason, Doug Lea introduces lazySet method for Atomic* objects. Let's use AtomicInteger in replacement of volatile int:

public class TestJIT
{
    private static AtomicInteger field1 = new AtomicInteger(0);
    private static int field2;
    private static int field3;
    private static int field4;
    private static int field5;
    private static AtomicInteger field6 = new AtomicInteger(0);
    
    public static void assign(int i)
    {
        field1.lazySet(i << 1);
        field2 = i << 2;
        field3 = i << 3;        
        field4 = i << 4;
        field5 = i << 5;
        field6.lazySet(i << 6);
    }
    
    public static void main(String[] args) throws Throwable
    {
        for (int i = 0; i < 10000; i++)
        {
            assign(i);
        }
        Thread.sleep(1000);
    }
}

We have the following output for PrintAssembly:

  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # this:     ecx       = 'com/bempel/sandbox/TestJIT'
  # parm0:    edx       = int
  #           [sp+0x10]  (sp of caller)
  0x024c7f40: cmp    eax,DWORD PTR [ecx+0x4]
  0x024c7f43: jne    0x024ace40         ;   {runtime_call}
  0x024c7f49: xchg   ax,ax
[Verified Entry Point]
  0x024c7f4c: mov    DWORD PTR [esp-0x3000],eax
  0x024c7f53: push   ebp
  0x024c7f54: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::assign@-1 (line 30)
  0x024c7f5a: mov    ebp,edx
  0x024c7f5c: shl    ebp,1              ;*ishl
                                        ; - com.bempel.sandbox.TestJIT::assign@5 (line 30)
  0x024c7f5e: mov    ebx,0x150
  0x024c7f63: mov    ecx,DWORD PTR [ebx+0x56ec4b8]
                                        ;*getstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@0 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f69: test   ecx,ecx
  0x024c7f6b: je     0x024c7fd0
  0x024c7f6d: mov    DWORD PTR [ecx+0x8],ebp  ;*invokevirtual putOrderedInt
                                        ; - java.util.concurrent.atomic.AtomicInteger::lazySet@8 (line 80)
                                        ; - com.bempel.sandbox.TestJIT::assign@6 (line 30)
  0x024c7f70: mov    edi,edx
  0x024c7f72: shl    edi,0x2
  0x024c7f75: mov    ebp,edx
  0x024c7f77: shl    ebp,0x3
  0x024c7f7a: mov    esi,edx
  0x024c7f7c: shl    esi,0x4
  0x024c7f7f: mov    eax,edx
  0x024c7f81: shl    eax,0x5
  0x024c7f84: shl    edx,0x6            ;*ishl
                                        ; - com.bempel.sandbox.TestJIT::assign@39 (line 35)
  0x024c7f87: mov    ebx,0x154
  0x024c7f8c: mov    ebx,DWORD PTR [ebx+0x56ec4b8]
                                        ;*getstatic field6
                                        ; - com.bempel.sandbox.TestJIT::assign@33 (line 35)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f92: mov    ecx,0x158
  0x024c7f97: mov    DWORD PTR [ecx+0x56ec4b8],edi
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@12 (line 31)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f9d: mov    ecx,0x164
  0x024c7fa2: mov    DWORD PTR [ecx+0x56ec4b8],eax
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@30 (line 34)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fa8: mov    edi,0x15c
  0x024c7fad: mov    DWORD PTR [edi+0x56ec4b8],ebp
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@18 (line 32)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fb3: mov    ecx,0x160
  0x024c7fb8: mov    DWORD PTR [ecx+0x56ec4b8],esi
                                        ;*putstatic field4
                                        ; - com.bempel.sandbox.TestJIT::assign@24 (line 33)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fbe: test   ebx,ebx
  0x024c7fc0: je     0x024c7fdd
  0x024c7fc2: mov    DWORD PTR [ebx+0x8],edx  ;*invokevirtual putOrderedInt
                                        ; - java.util.concurrent.atomic.AtomicInteger::lazySet@8 (line 80)
                                        ; - com.bempel.sandbox.TestJIT::assign@40 (line 35)
  0x024c7fc5: add    esp,0x8
  0x024c7fc8: pop    ebp
  0x024c7fc9: test   DWORD PTR ds:0x180000,eax
                                        ;   {poll_return}
  0x024c7fcf: ret    

So now no trace of memory barriers whatsoever: (no mfence, no lock add instruction). But we have the order of field 1 and field 6 remain (first and last) then field2, field5, field3 and field4.
In fact, lazySet method call putOrderedInt from Unsafe object which do not emit memory barrier but guarantee no reordering.

Read Access

We will now examine what is the cost of a volatile read with this example:

public class TestJIT
{
    private static volatile int field1 = 42;
    
    public static void testField1(int i)
    {
        if (field1 < 0)
        {
            System.out.println("field value: " + field1);
        }
    }
    
    public static void main(String[] args) throws Throwable
    {
        for (int i = 0; i < 10000; i++)
        {
            testField1(i);
        }
        Thread.sleep(1000);
    }
}

The PrintAssembly output looks like:

  # {method} 'testField1' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x024f8800: mov    DWORD PTR [esp-0x3000],eax
  0x024f8807: push   ebp
  0x024f8808: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::testField1@-1 (line 22)
  0x024f880e: mov    ebx,0x150
  0x024f8813: mov    ecx,DWORD PTR [ebx+0x571c418]
                                        ;*getstatic field1
                                        ; - com.bempel.sandbox.TestJIT::testField1@0 (line 22)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024f8819: test   ecx,ecx
  0x024f881b: jl     0x024f8828         ;*ifge
                                        ; - com.bempel.sandbox.TestJIT::testField1@3 (line 22)
  0x024f881d: add    esp,0x8
  0x024f8820: pop    ebp
  0x024f8821: test   DWORD PTR ds:0x190000,eax
                                        ;   {poll_return}
  0x024f8827: ret    

No memory barrier. Only a load from memory address to a register. This is what is required for a volatile read. JIT cannot optimize to cache it into a register. But with cache subsystem, latency for volatile read is very similar compared to regular variable read. If you test the same Java code without volatile modifier the result for this test is in fact the same.
Volatile reads put also some constraint on reordering.

Summary

Volatile access prevents instructions reordering either at compiler level or at hardware level.
It ensures visibility, for write with the price of memory barriers, for read with no register caching or reordering possibilities.

References


Saturday, May 23, 2015

Measuring contention on locks

Locks is one of the major bottleneck in scalability for your code. Lock-Free structures offer an alternative to your lock usage. However it is sometimes more difficult to use or to integrate into your existing code. Before rushing into this specific approach, it would be interesting to determine which part of your locking code would really benefit from this. Locks become a bottleneck if more than one thread tries to acquire them and needs to wait for a release. This is contention. Measuring this contention will help us to pinpoint which locks need to be improved.

You see in this Java world there are two kinds of locks. Those with synchronized blocks, and those which use java.util.concurrent.Lock. The first ones are directly handled by the JVM with the help of a specific byte code (monitorenter & monitorexit). With those, JVM provides, via JVMTI, some events that can be used by native agents to get information about synchronized blocks: MonitorContendedEnter & MonitorContentedEntered.
Profilers like YourKit exploit this information to provide contention profiling.



Azul Systems provides with their Zing JVM a tool named ZVision to also profile the synchronized blocks:

AzulZvision_syncLocks.png

But what about j.u.c.Lock ?

Here, this is trickier: j.u.c.Lock is not handled by the JVM. It is part of the JDK classes like any regular library. No special treatment. Hence, no special information about them.
YourKit is not able to profile them. However, another profiler which is able to profile j.u.c.Lock exists: JProfiler.

JProfiler_SyncLocks.png

JProfiler_jucLocks.png

I suppose it uses instrumentation to insert callbacks when j.u.c.Lock classes are used. Profile information seems precise with callstacks. It helps to identify which locks are effectively in contention.

I have also found an article describing jucProfiler, a tool created by an IBM team.

Before finding those 2 last tools, I made a very light and quick j.u.c.Lock profiler on my own. The technique is simple and I will describe it below. Full code available on GitHub.
The goal was to profile existing code using j.u.c.Locks. The plan wasn't to create any wrapper around them or any subtype, but to intercept the actual classes. I copied the JDK classes and kept them in the same package.
I have identified in the code where the lock fails to be acquired. It means in those cases there is a contention with another thread which already holds the current lock.
Here is an extract from ReentrantLock:

I incremented a contention field to keep track of the fail attempts. No need to have an atomic counter for this. If some contended locks are not precisely reported, it is not a big deal.

Each lock instance is identified at the construction by a unique ID and a callstack, stored into a global static map.

When you want to report the statistics about lock contention, you can traverse the map and print the information about each lock including the number of contentions detected.

To use those classes instead of the ones from the JDK you need to prepend your jar containing them into the bootclasspath. This way, it is your classes that are looked up before those contained into the rt.jar

-Xbootclasspath/p:contention-profiling.jar  

Here is an example of the output of the profiler:

The line creating lock n indicates where the lock id n is created. The line with equals reports you for a locking id (left of the equal), the number of time that lock has contended (right of the equal).
Then it helps you to focus and work on the most contented ones first.

And remember: Measure, Don't Premature!

Thanks Georges & Aurélien for the review.