Monday, July 13, 2015

Notify... oh, wait! I have a signal.


When you want to pipeline, delegate some task asynchronously or simply synchronize 2 threads, you usually end up using wait/notify couple (or even await/signal, depending on your taste).

But what is the cost or the overhead for this kind of pattern ?

Under the hood

What happening when we are using wait/notify couple ?
I simplify here to this couple, as the other (await/signal) calls the same set of underlying methods:


Basically we are performing system calls. For Object.wait we ask the OS scheduler to move the current thread to the wait queue

For Object.notify, we ask the scheduler (via futexes[1] on Linux) to move one of the waiting threads from the wait queue to the run queue to be scheduled when possible.

Just a quick remark about system calls: contrary to the common belief, system calls do not imply context switches[2]. It depends on the kernel implementation. On Linux there is no context switch unless the system call implementation requires it like for IO. In the case of pthread_cond_signal, there is no context switches involved.

Knowing that, what is the cost of calling notify for a producer thread ?

Measure, don't guess!

Why not building a micro-benchmark ? Because I do not care about average latency, I care about outliers, spikes. How it behaves for 50, 90, 95, 99, 99.9 % of the time.  What may be the maximum I can observe?
Let's measure it with HdrHistorgram from Gil Tene and the following code:

This code basically creates n pairs of threads: one (critical) which trying to notify the second (flushing) that data are available to be processed (or flushed).
I run this code with following parameters 16 1000. It means that we have 16 pairs of threads that doing wait/notify.

Results on Windows (ns):
count: 16000
min: 0
max: 55243
mean: 549.5238125
50%: 302
90%: 1208
95%: 1812
99%: 3019
99.9%: 11472

Results on Linux (ns):
count: 16000
min: 69
max: 20906
mean: 1535.5181875
50%: 1532
90%: 1790
95%: 1888
99%: 2056
99.9%: 3175

So most of the time we can observe couple of microseconds for a call to notify. But in some cases we can reach 50us! For Low Latency systems it can be an issue and a source of outliers.

Now, if we push a little our test program to use 256 pairs of threads we end up with the following results:

Results on Windows (ns):
count: 256000
min: 0
max: 1611088
mean: 442.25016015625
50%: 302
90%: 907
95%: 1208
99%: 1811
99.9%: 2717

Results on Linux (ns):
count: 256000
min: 68
max: 1590240
mean: 1883.61266015625
50%: 1645
90%: 2367
95%: 2714
99%: 7762
99.9%: 15230

A notify call can take 1.6ms!

Even though there is no contention in this code per se, there is another kind of contention that happens in the kernel. Scheduler needs to arbitrate which thread can be run. Having 256 threads that trying to wake up their partner thread put a lot of pressure on the scheduler which become a bottleneck here.


Signaling can be a source of outliers not because we have contention on executing code between threads but because the OS scheduler needs to arbitrate among those threads, responding to wake up requests.


[1] Futex are tricky U. Drepper:

Tuesday, July 7, 2015

WhiteBox API

I have already seen this in JCStress but this is in a post from RĂ©mi Forax on mechanical sympathy forum that it brings attention to me when I saw what is possible to do with it. Here is a summary:

This API is usable since JDK8 and there is some new additions in JDK9.

But how to use it ?
This API is not part of the standard API but in the test library from OpenJDK. you can find it here.
Download the source of OpenJDK then either you build it entirely and grab the wb.jar or

  1. go to test/testlibrary/whitebox directory
  2. javac -sourcepath . -d . sun\hotspot\**.java
  3. jar cf wb.jar .

Place you wb.jar next to your application and launch it with:

java -Xbootclasspath/a:wb.jar -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI ...

Here is an examble you can run with WhiteBox jar:

For this example you need to add -XX:MaxTenuringThreshold=1 to make it work as expected.
Now you have an API to trigger minor GC and test if an object resides in Old generation, pretty awesome !

You can also trigger JIT compilation on demand for some methods and change VM flags on the fly:

Unlike unsafe, this API seems difficult to use in production environment, but at least you can have fun in labs or adding this like the OpenJDK for your low level tests. Enjoy!