Unit testing is one of the cornerstones of modern software development, but there’s a surprising lack of advice about how to write good unit tests. That’s a shame, because bad unit tests are worse than no unit tests at all. Over the past decade at Electric Cloud, I’ve written thousands of tests — a full build across all platforms runs over 100,000 tests. In this post I’ll share some tips for writing robust, maintainable tests. I learned these the hard way, but hopefully you can learn from my mistakes.
A key focus here is on eliminating so-called “flaky tests”: those that work almost all the time, but fail once-in-a-great-while for reasons unrelated to the code under test. Such unreliable tests erode confidence in the test suite and even in the value of unit testing itself. In the worst case, a history of failures due to flaky tests can cause people to ignore sporadic-but-genuine test failures, allowing rarely seen but legitimate bugs into the wild.
If that’s not enough to convince you to eliminate flaky tests, consider this: suppose each of your flaky tests fails just one time out of every 10,000 runs, and that you have a thousand such tests overall. At that rate, about 10% of your CI builds will fail due to flaky tests.
Now that I’ve got your attention, let’s see how to write better tests. Got some tips of your own? Add them in the comments!
1. Never use hardcoded network port numbers
The software I write involves network communication, which means that many of the tests create network sockets. In pseudo-code, a test for the server component looks something like this:
- Cleanup the previous server instance (if any).
- Instantiate the server on port 6003.
- Connect to the server via port 6003.
- Send some message to the server via the socket.
- Assert that the response received is correct.
At first glance this seems pretty safe: no standard service uses port number 6003, so there won’t be any contention with third parties for that port, and by front-loading the cleanup of the server we ensure that there won’t be any contention with our other tests for the port. And, of course, it (seems to) work! In fact, it probably works almost every time you run it. But rest assured: one day, seemingly at random, this test will fail.
For us the failures started after literally years and tens of thousands of executions. We never saw the same test fail twice, but the failure mode was always the same: “Only one usage of each socket address (protocol/network address/port) is normally permitted.” I wish I could say with certainty why the failures started happening — I suspect that our anti-virus software transiently grabs a dynamic port, and sometimes it just happens to grab the port that we planned to use.
Fortunately the solution is simple: instead of using a hardcoded port number, use a dynamic port assigned by the operating system. Usually that means binding to port number zero. This is slightly less convenient than a hardcoded port number, because you’ll have to query the socket after its bound to determine the actual port number, but that’s a small price to pay to be confident you’ll never have a spurious test failure due to this mistake.
2. Never use “sleep()” to synchronize test threads.
A common mistake when writing tests that involve multiple threads is to attempt to use sleep() to synchronize events in different threads. For example, you may have a server thread which opens a socket and waits for a connection. You want to test the behavior of the server thread, but you have to be sure that the socket is opened and ready to accept connections before you try to connect to it. A quick-and-dirty approach might look like this:
- Start server thread.
- Sleep a bit to give the thread time to get started.
- Open a connection to the server port.
There are some problems with this strategy. If you set the delay too short, occasionally the test will fail because the server socket won’t be ready — when you try to open the client-side connection you’ll get a connection refused error. Conversely, you can make the test fairly reliable by setting the delay very long — multiple seconds — but then the test will take at least that long to run, every time you run it.
Instead you should use a condition variable to synchronize the two threads. If that is not practical or not possible, a retry loop is a decent alternative. In that case, make sure that your loop doesn’t retry so frantically that it starves the other thread of CPU cycles to make progress. I like to use an exponential backoff, so initially I get a few retries at very short intervals, then the intervals become progressively longer to give the other thread more time to work:
3. Never rely on timing data to verify behavior.
Another mistake is using the observed duration of an operation to assert that a particular behavior has been implemented. For example, to test that a socket read operation times out if no data is received within a set period of time, you might write test code like this:
- Set socket timeout to 200ms
- Mark start time
- Attempt to read from socket
- Mark end time
- Assert that end minus start is between 200ms and 250ms
The problem is that your test runner might get put to sleep between step 2 and 3, and again between step 3 and step 4 — there’s just no way to control what else might be happening on the computer executing the tests. That means that the delta between the times could be significantly higher than the 200ms you expect. Depending on your hardware and timers, it could even be less than 200ms, despite your code being implemented correctly!
There are at least two robust alternatives to implement this test. The first is to add a counter to the code under test, to be incremented whenever the read operation times out. In the test you could check the value before and after the attempted read, and reasonably expect the value to increased by exactly one. The second approach is to forgo the explicit validation altogether — if the test completes, then the timeout must be operating correctly. If the test hangs, then the timeout must not be operating.
4. Never rely on implicit file timestamps.
If your software is sensitive to file timestamps, such as for determining whether file X is newer than file Y, then avoid trusting implicit timestamps in your tests. For example, you might write test code like this:
- Create “old” file Y.
- Create “new” file X.
- Verify that the unit under test behaves correctly given that X is newer than y.
Superficially this looks sound: X is created after Y, so it should have a timestamp later than Y. Unless, of course, it doesn’t, which can happen for a variety of odd reasons. For example, the system clock might get adjusted underneath your feet, due to NTP or another time-synchronization system. I’ve even seen cases where the file creations happen so quickly that the two files effectively have identical timestamps!
The fix is simple: explicitly set the timestamps on the files to ensure that the relationship is as expected. Generally you should specify a difference of at least 2 full seconds, to accommodate filesystems that have very low resolution timestamps. You should also avoid relying on subsecond timestamp resolution, again to accommodate filesystems that don’t support very high-resolution timestamps.
5. Include diagnostic information in test assertions.
You can save yourself a lot of trouble by arranging your test assertions so that they provide detailed information about any failures, rather than simple telling you yes or no, the assertion passed. For example, here’s an unhelpful assertion:
If the errors variable contains error text, the assertion will fail — but you’ll have no idea what the errors were, and thus you’ll have no idea what went wrong in the test. In contrast, here’s an informative assertion:
Now if the assertion fails the test harness will show you value of errors, so you’ll have some useful information to start your debugging.
6. Include an explanation of the test in the comments.
Finally, don’t forget to put comments in your test code! Explain how the test works, and why you believe the test actually exercises the feature that you think it tests. It may seem a bit tedious, but remember that the rest of your team may not be as familiar with the code as you are, and they may not know what steps are needed to elicit a particular response from the code. For that matter, in a few months or years you may not remember how to test what you’re trying to test. Such comments are invaluable when updating tests after a refactoring, to understand how the test should be adjusted, as well as when debugging — to understand why a test failed to expose a defect. Here’s an example:
Everybody knows it’s important to write unit tests. Following the suggestions here will help make sure that your tests are reliable and maintainable. If you have tips of your own, add them in the comments!