Lbench simple Linux multithread benchmarking tool
Various aspects of CPU and OS performance are measured for 1-9 parallel threads. The objective is to measure how well performance scales with multiple CPU processor cores and multiple process threads. There are two modes of operation, command-line and GUI window.
Benchmarks include the following:
+ CPU speed for integer and double arithmetic and some math functions.
+ Memory speed for small (cache) and large (main) memory regions.
+ Overhead times for thread creation and switching, and process creation.
+ Disk performance for random and serial operations with different block sizes.
+ The time to compute fibonacci numbers using the recursion method.
Conclusions based on my development system with 4 x 2.67 GHz processor cores
- For pure calculations, 4 processors have 4 times the performance of one.
- The same conclusion also applies to cache memory performance.
- Main memory can be a significant bottleneck. For the 50 MB memory move benchmark, 4 processors have only 1.4 times the total throughput of 1 processor.
- Two threads contending for a mutex-controlled resource can swap ownership 7 million times per second. Four threads can only achieve 39% of this rate.
- The overhead to create a thread which does nothing but exit is about 7 microseconds, and this is independent of the number of processors or running threads.
- The overhead to create a sub-process (via system() or popen()) which does nothing but exit is about 1.2 milliseconds for one thread, increasing to about 1.7 milliseconds when 4 parallel threads are doing this continuously.
- Solid-state disk performance is fantastic.
Lbench makes the following measurements for 1 to 9 parallel threads:
- integer-32 and floating-64 arithmetic performance
- performance of common engineering functions (sqrt, sin, etc.)
- memory throughput for cache (L1, L2) and main memory
- time overhead to acquire and release a mutex lock
- time overhead to start and complete a process thread
- time overhead to start and complete a sub-process
- time overhead to call and return from a function
- disk throughput for serial and random I/O using various block sizes
(disk I/O is direct to disk and does not use memory caching)
Lbench has two other functions not related to benchmarking:
- Cooling performance: run multiple CPU-bound threads, report processor core temperatures,
detect if the CPU clock is being throttled down due to thermal overload. - Memory burn-in: continuously fill memory with random values, read back and compare.
The user guide is available from the [help] button and goes into more detail about each benchmark and its configurable parameters.
Tarball with source code, make file, user guide: downloads
Example Output
In this case, 4 parallel threads produced the 4 values seen in each report. 
Here is a comparison between a fast rotating disk and a solid-state disk:
Benchmark Results
These are based on my current development system: Intel Core i7 920 (4 SMP processors running at 2.67 GHz) and Ubuntu 9.04 (64-bit). These numbers are PER THREAD measurements, so overall throughput requires multiplication by the thread count.
benchmark |
units
|
1-thread
|
2-threads
|
4-threads |
integer add/subtract |
MOP/s |
1137 |
1135 |
1137 |
integer multiply/divide |
MOP/s |
371 |
370 |
371 |
double add/subtract |
MOP/s |
521 |
520 |
521 |
double multiply/divide |
MOP/s |
226 |
226 |
226 |
double square root |
MOP/s |
83 |
83 |
82 |
pow (exponentiation) |
MOP/s |
30 |
30 |
30 |
trig sine function |
MOP/s |
40 |
40 |
40 |
trig arc-sine function |
MOP/s |
34 |
34 |
34 |
memory move loop in 5 KB |
GB/s |
20.8 |
20.8 |
20.8 |
memory move loop in 50 KB |
GB/s |
12.2 |
12.2 |
12.2 |
memory move loop in 500 KB |
GB/s |
8.9 |
8.0 |
4.1 |
memory move loop in 5 MB |
GB/s |
5.1 |
2.5 |
1.4 |
memory move loop in 50 MB |
GB/s |
4.0 |
2.5 |
1.4 |
mutex lock/unlock loop |
MOP/s |
40 |
7.0 |
1.37 |
thread start/exit loop |
KOP/s |
148 |
139 |
141 |
subprocess start/exit loop |
OP/s |
871 |
730 |
597 |
Fibonacci 44 |
secs |
12.4 |
12.2 |
12.1 |
Explanation of units
MOP/s = million operations per second per thread
KOP/s = thousand operations per second per thread
OP/s = operations per second per thread
GB/s = gigabytes per second per thread
secs = elapsed time in seconds per thread

