Lbench  a simple Linux multithread benchmarking tool

Lbench was written to satisfy a personal desire to better understand some obscure performance issues. The performance increase from using multiple processor cores can be nearly 100% per core in some cases, and negative (an overall slowdown) in others. Lbench has helped me to better understand and characterize the bottlenecks.

Conclusions  based on my development system with 4 processor cores

  1. For pure calculations, 4 processors have 4 times the performance of one.
  2. The same conclusion also applies to cache memory performance.
  3. Main memory can be a significant bottleneck. For the 50 MB memory move benchmark, 4 processors have only 1.4x the throughput of 1 processor.
  4. Two threads contending for a mutex-controlled resource can swap ownership 7 million times per second. Four threads can only achieve 39% of this rate.
  5. The overhead to create a thread which does nothing but exit is about 7 microseconds. This is independent of the number of processors.
  6. The overhead to create a sub-process (system() call) which does nothing but exit is about 1.2 milliseconds for one thread, increasing to about 1.7 milliseconds when 4 parallel threads are doing this continuously.
  7. Solid-state disk performance: read 120-140 MB/sec, write 20-40 MB/sec (erratic).

Lbench makes the following measurements for 1 to 9 parallel threads:

  • integer-32 and floating-64 arithmetic performance
  • performance of a few engineering functions (sqrt, sin, etc.)
  • memory throughput for cache and main memory
  • time overhead to acquire and release a mutex lock
  • time overhead to start and complete a process thread
  • time overhead to start and complete a sub-process
  • time overhead to call and return from a function
  • disk throughput for serial and random I/O using various block sizes

Lbench has two other functions not related to benchmarking:

  • Cooling performance: run multiple CPU-bound threads, report core temperatures, detect if the CPU clock is being throttled due to thermal overload.
  • Memory burn-in: continuously fill memory with random values, read back and compare.

The user guide is available from the [help] button and goes into more detail about each benchmark and its configurable parameters.

To download and install lbench, visit the download page.

Example Output
In this case, 4 parallel threads produced the 4 values seen in each report. The disk I/O rates are real, using a solid-state disk. All I/O is direct to disk and does not use memory caching.

The write performance of the solid-state disk may be erratic, but the overall performance is still superb. Here is the same benchmark run on a 10,000 rpm Raptor disk:


Benchmark Results
These are based on my current development system: Intel Core i7 920 (4 SMP processors running at 2.67 GHz) and Ubuntu 9.04 (64-bit). These numbers are PER THREAD measurements, so overall throughput requires multiplication by the thread count.

benchmark
units       
1-thread
2-threads
4-threads
integer add/subtract
MOP/s
1137
1135
1137
integer multiply/divide
MOP/s
371
370
371
double add/subtract
MOP/s
521
520
521
double multiply/divide
MOP/s
226
226
226
double square root
MOP/s
83
83
82
pow (exponentiation)
MOP/s
30
30
30
sin
MOP/s
40
40
40
asin
MOP/s
34
34
34
memory move loop in 5 KB
GB/s
20.8
20.8
20.8
memory move loop in 50 KB
GB/s
12.2
12.2
12.2
memory move loop in 500 KB
GB/s
8.9
8.0
4.1
memory move loop in 5 MB
GB/s
5.1
2.5
1.4
memory move loop in 50 MB
GB/s
4.0
2.5
1.4
mutex lock/unlock loop
MOP/s
40
7.0
1.37
thread start/exit loop
KOP/s
148
139
141
subprocess start/exit loop
OP/s
871
730
597
Fibonacci 44
secs
12.4
12.2
12.1


Explanation of units

MOP/s = million operations per second per thread
KOP/s = thousand operations per second per thread
OP/s = operations per second per thread
GB/s = gigabytes per second per thread
secs = elapsed time in seconds per thread