Microbenchmarks, as a general rule, are a waste of time. Let’s just get that out of the way up front. They are also, as a general rule, totally inaccurate, measuring the execution time of some snippet of code in a context that is completely divorced from the reality in which that code will actually be used. So if after reading this article you think, “I should tell Eric what a waste of time this was!” — don’t bother. I already know.
But… microbenchmarks are also fun, and sometimes interesting, and often vastly easier to implement than a real benchmark of the same code in a production system. So a couple weeks ago, when my colleague proposed using an MD5 hash with value zero as a sentinal indicating that the checksum had not yet been calculated, I wondered: what is the fastest way to test if an MD5 hash has any non-zero bits? I had some time to kill so I wrote a microbenchmark comparing several implementations. The results are presented here for your amusement and edification.
The goal of the benchmark is to determine which of several methods can most quickly determine whether an MD5 hash is all zeroes. An MD5 hash is 128 bits long, so in essence this problem boils down to simply checking for non-zero bits in an arbitrary sequence of 16 bytes. You can find the benchmark source code in my Github repo.
For sample data I simply allocated about 100,000 17 byte arrays, then set one byte in each to a non-zero value. This structure made it wasy to easily test the effect of memory alignment on performance, by using either the first 16 or the last 16 bytes of each 17 byte array as the value under test. The total size was significant as well: smaller than the L1 cache on a typical modern CPU, so we avoid measuring memory bandwidth performance.
I tested the following methods for determining whether a hash is all zero, in both aligned and unaligned varieties:
- Naive loop over bytes: the most obvious approach simply loops over the bytes, testing each in turn.
- Unrolled loop over bytes: loop unrolling is a common optimization for loops with a fixed number of iterations.
- Bitwise OR of bytes: OR all bytes together, then compare the result to zero.
- Slice by four: treat the 16 bytes as an array of four 32-bit integers, testing each for equality with zero.
- Slice by eight: treat the 16 bytes as an array of two 64-bit integers, testing each for equality with zero.
- Find first set: use the GCC builtin function __builtin_ffs, which finds the first non-zero bit in a 32-bit integer.
The code was compiled to 64-bit binaries using GCC 4.7.2 with -O2 optimization and no debugging symbols.
I ran the benchmarks on two systems. First I tried a quad-core hyperthreaded 1.73GHz Intel Core i7 laptop with 16GB RAM and an SSD hard drive. All cores were put into performance mode to ensure no CPU frequency scaling was enabled, using (for example) echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor. Here are the results (longer is better):
The results are a bit erratic, to be sure — for example, it makes no sense that the unaligned version of the unrolled naive loop should be faster than the aligned version. This is likely just because the operation being measured is so fast that it’s hard to get a “pure” measurement: even tiny fluctuations in system load perturb the tests. You’ll see that if you run the benchmark a few times, you’ll get slightly different results each time. That just means that we shouldn’t make any hard-and-fast decisions based on the exact numbers here.
Nevertheless, the difference between slice-by-four or slice-by-eight and the other strategies is substantial enough that I trust the overall result, if not the exact numbers. That is, slice-by-four and slice-by-eight are clearly significantly faster than any other approach. But — and here’s where we discover the degree of yak shaving we’ve been up to — even the slowest strategy is still pretty damn fast. In all honesty, it is not going to make a lick of difference in overall application performance, unless you really do need to do billions of these checks. A realistic upper bound for my application is maybe ten million, which would consume a tiny fraction of a second using even the naive loop.
One final surprise in this data is the difference in performance between aligned and unaligned memory access — or rather, the lack therof. Conventional wisdom is that you pay a performance penalty for accessing unaligned memory, at least when you try to treat it as 32- or 64-bit blocks. In fact, this result supports other tests which indicate that on the Intel Core i7 there is effectively no penalty for unaligned memory access.
If you’re only working on x86 architectures, you may consider this exercise concluded. But we actually run our software on SPARC architecture as well, so before committing to an implementation let’s take a look at how the benchmark behaves there. This time I used a 1GHz SPARCv9 CPU:
Slice-by-four and slice-by-eight are fastest here too, as long as the data is aligned. If not — BOOM! The application crashes with a bus error, because the SPARC architecture is actually quite sensitive to data alignment. If you want to treat a piece of memory as an integer, it had better be properly aligned.
Informed by these results, we opted to use the slice-by-four strategy. That required a modification of our code, which previously did not guarantee alignment of MD5 hashes. Fortunately that modification was trivial, so it cost us little time and did not make the code any less clear. But you can see hints of the real danger of microbenchmarks: it’s often difficult for a developer to ignore the existence of a faster-but-more-complex strategy, despite evidence that the simple implementation is more than adequately performant. In this case the cost of enabling the faster implementation was negligable, but I’ve seen developers (including myself) needlessly contort code in the name of performance, doggedly defending their choices with microbenchmarks like these. Don’t let yourself become another statistic: use microbenchmarks, sure, but always evaluate the results in the larger context of overall application performance.
With that, I invite you to sound off in the comments: what did I overlook in my microbenchmark? How were the tests flawed? What other strategies do you know for testing whether a series of 16 bytes contains any non-zero bits?