Lbench a simple Linux multithread benchmarking tool
Lbench was written to satisfy a personal desire to better understand some obscure performance issues. The performance increase from using multiple processor cores can be nearly 100% per core in some cases, and negative (an overall slowdown) in others. Lbench has helped me to better understand and characterize the bottlenecks.
Conclusions based on my development system with 4 processor cores
- For pure calculations, 4 processors have 4 times the performance of one.
- The same conclusion also applies to cache memory performance.
- Main memory can be a significant bottleneck. For the 50 MB memory move benchmark, 4 processors have only 1.4x the throughput of 1 processor.
- Two threads contending for a mutex-controlled resource can swap ownership 7 million times per second. Four threads can only achieve 39% of this rate.
- The overhead to create a thread which does nothing but exit is about 7 microseconds. This is independent of the number of processors.
- The overhead to create a sub-process (system() call) which does nothing but exit is about 1.2 milliseconds for one thread, increasing to about 1.7 milliseconds when 4 parallel threads are doing this continuously.
- Solid-state disk performance: read 120-140 MB/sec, write 20-40 MB/sec (erratic).
Lbench makes the following measurements for 1 to 9 parallel threads:
- integer-32 and floating-64 arithmetic performance
- performance of a few engineering functions (sqrt, sin, etc.)
- memory throughput for cache and main memory
- time overhead to acquire and release a mutex lock
- time overhead to start and complete a process thread
- time overhead to start and complete a sub-process
- time overhead to call and return from a function
- disk throughput for serial and random I/O using various block sizes
Lbench has two other functions not related to benchmarking:
- Cooling performance: run multiple CPU-bound threads, report core temperatures, detect if the CPU clock is being throttled due to thermal overload.
- Memory burn-in: continuously fill memory with random values, read back and compare.
The user guide is available from the [help] button and goes into more detail about each benchmark and its configurable parameters.
To download and install lbench, visit the download page.
Example Output
In this case, 4 parallel threads produced the 4 values seen in each report. The disk I/O rates are real, using a solid-state disk. All I/O is direct to disk and does not use memory caching.
The write performance of the solid-state disk may be erratic, but the overall performance is still superb. Here is the same benchmark run on a 10,000 rpm Raptor disk:

Benchmark Results
These are based on my current development system: Intel Core i7 920 (4 SMP processors running at 2.67 GHz) and Ubuntu 9.04 (64-bit). These numbers are PER THREAD measurements, so overall throughput requires multiplication by the thread count.
benchmark |
units
|
1-thread
|
2-threads
|
4-threads |
integer add/subtract |
MOP/s |
1137 |
1135 |
1137 |
integer multiply/divide |
MOP/s |
371 |
370 |
371 |
double add/subtract |
MOP/s |
521 |
520 |
521 |
double multiply/divide |
MOP/s |
226 |
226 |
226 |
double square root |
MOP/s |
83 |
83 |
82 |
pow (exponentiation) |
MOP/s |
30 |
30 |
30 |
sin |
MOP/s |
40 |
40 |
40 |
asin |
MOP/s |
34 |
34 |
34 |
memory move loop in 5 KB |
GB/s |
20.8 |
20.8 |
20.8 |
memory move loop in 50 KB |
GB/s |
12.2 |
12.2 |
12.2 |
memory move loop in 500 KB |
GB/s |
8.9 |
8.0 |
4.1 |
memory move loop in 5 MB |
GB/s |
5.1 |
2.5 |
1.4 |
memory move loop in 50 MB |
GB/s |
4.0 |
2.5 |
1.4 |
mutex lock/unlock loop |
MOP/s |
40 |
7.0 |
1.37 |
thread start/exit loop |
KOP/s |
148 |
139 |
141 |
subprocess start/exit loop |
OP/s |
871 |
730 |
597 |
Fibonacci 44 |
secs |
12.4 |
12.2 |
12.1 |
Explanation of units
MOP/s = million operations per second per thread
KOP/s = thousand operations per second per thread
OP/s = operations per second per thread
GB/s = gigabytes per second per thread
secs = elapsed time in seconds per thread
