The ElectricAccelerator 7.0 “Ship It!” Award

With ElectricAccelerator 7.0 out the door, it’s finally time for the moment you’ve all been waiting for: the unveiling of the Accelerator 7.0 “Ship It!” award. This time I picked the Clockwork Android, in light of our emphasis on Android build performance. Here’s the trading card that accompanied the figure:

BEEP BOP BOOP

BEEP BOP BOOP

metrics metrics metrics metrics

metrics metrics metrics metrics

As with the 6.2 award, I included some metrics about the release:

  • Number of days in development. This release was relatively long compared to our other releases — not quite our longest development cycle, but close. That’s partly because this release encompassed the Thanksgiving and Christmas seasons, which typically costs us 3-4 weeks of development and testing time. We also deliberately pushed out the release date about 2 weeks to incorporate feedback from beta testers.
  • JIRA issues closed. We resolved 185 issues in this release. That’s double what we had in 6.2, and it includes some really cool new features.
  • Performance improvement. Since this release was all about performance, it made sense to include the data that proves our success. I had some trouble finding a good way to visualize the improvement, but I’m happy with the finished product.

Of course, none of the achievements in Accelerator 7.0 would have been possible without the hard work and dedication of the incredibly talented Accelerator team. Thank you all!

“Playing” with agile

Recently we invited a Scrum coach to Electric Cloud to teach us how to get started with the Scrum model of agile development. On the first day we played a game intended to introduce us to the core elements of Scrum: plan, do, inspect, adapt (or “plan, do, check, act”; or “the Deming cycle”). Without getting into a deeper discussion of Scrum itself, I thought I would share my team’s performance in this fun little game. If you’re familiar with ElectricAccelerator, our game strategy will come as no surprise: it exploits parallel processing and horizontal scalability to improve performance.

The game was simple: we were given a bucket of rubber bouncy balls and instructed to pass balls from person to person, until every member of the team had touched the ball. For each ball that completed the circuit we earned one point; for each drop we were penalized three points. A few rules made the game more interesting. First, it was forbidden for two people to touch a ball at the same time — there had to be “air time” between individuals. Second, we could not pass balls to the person directly to our left or to our right. Finally, there was a time limit (just like a sprint): we had only 2 minutes to pass balls in each round.

At the start of the game, we were given 5 minutes to plan our strategy and make a prediction of how many balls we would pass. Between each round we had 3 minutes more to modify our strategy based on our experience in the previous round and make a new prediction for the next round. If you are familiar with Scrum you’ll recognize the analogy to story points.

In total we had 12 players plus one scribe (me) that was tasked with counting the number of balls passed and dropped.

Round 1 (plan: 0; actual: 29)

Our first planning phase was best described as chaotic. It wasn’t actually clear who was on our team or not, due to some stragglers to the activity. We weren’t sure about the constraints. Everybody had ideas about how best to pass the balls, so everybody was talking at once. It seemed simple, but in fact we had barely gotten everybody in place when the 5 minute prep time elapsed. We did manage to agree on the three key elements of our strategy though:

  • Dropping balls into the cupped hands of the receiver, rather than throwing them, to minimize the risk of dropping balls.
  • Two rings of players, one inner and one outer, facing each other. Balls would be passed in a zig-zag between the rings.
  • Parallel passing. Everybody would be either passing or receiving at all times.

This diagram shows the positions of the players, as well as which players had a ball at the start of the round:

Scrum game, round 1

As you can see, we had too many balls “in play” when we started, given our strategy — a direct consequence of unclear communication during the planning phase. The surplus balls were dropped almost immediately. Our final score for this round was 29: 1 point for each of 35 balls passed, minus 6 points for 2 balls dropped.

Round 2 (plan: 50; actual: 72)

Round 1 demonstrated that our core strategy was sound, but to improve performance we decided to make a couple tweaks. First, we made certain that we were in agreement about which players would start with balls: only those in the outside ring. Second, we realized we could improve throughput by passing two balls at a time, instead of just one. With our drop-into-cupped-hands strategy this was hardly more risky than one ball at a time. We predicted that we would pass 50 balls, about 60% more than we did in round 1. Here’s the updated diagram showing the starting positions of the players and balls for round 2:

Scrum game, round 2

Our score in round 2 was 72: 72 balls passed, with zero dropped.

Round 3 (plan: 120; actual: 60)

At this point we believed we had everything worked out. We increased the balls-per-pass to three and predicted that this would result in about 120 balls passed. But during the planning phase one of our players abruptly left — to be honest I’m not even sure who it was or why they stepped away. All I know is that suddenly we had only 11 players instead of 12, which left us with 6 on the outer ring but only 5 on the inner ring. We didn’t realize the problem until people started lining up in position near the end of the planning period. With the clock ticking we made an exceptionally poor decision about how to handle the mismatch: one of the inner ring players would serve as receiver for two of the outer ring players. First they would receive from player A, then pass to player B; then immediately receive the balls back from player B before sending them on to player C. Sounds complicated, right? It was. Here’s the updated diagram:

Scrum game, round 3

This proved was disastrous for our performance. At speed, it was (understandably) hard for the player serving double-duty to efficiently execute the elaborate sequence of exchanges. In addition, we were careless when we grabbed the extra balls we needed: although most were consistently round, a few were those oddly shaped rubber rocks which move in unpredictable ways. These misshapen lumps of rubber are just a bit harder to catch than regular balls, and that slowed us down. Our final score in this round was just 57: 60 balls passed, one dropped.

Round 4 (plan: 120; actual: 123)

The obvious problem in round three was the mismatch in the sizes of the inner and outer rings. The solution was obvious too: remove one player from the outer ring to restore equilibrium. There was just one problem. According to the rules of the game, a ball had to be touched by every player in order to count as having been passed. What could we do? We pled our case to the coach, who agreed to let us have one person sit out this round — a demonstration of another fact of agile development: sometimes a team can be made more productive by having fewer people on it. With 5 players on each ring, we again predicted that we would pass 120 balls. Here’s how the layout looked for this round:

Scrum game, round 4

This was our best round yet with a final score of 123: 135 balls passed, with only four drops.

Review

Overall I was really pleased with our performance in this game — granted, the point of the exercise was not actually to see how many balls we could pass around, but to experience the plan-do-inspect-adapt cycle directly. And we certainly did that too. But come on! How can you not be excited by a more than 4x improvement in throughput from round 1 to round 4? I’m not surprised though. After all, speed is the name of the game for ElectricAccelerator. This is what we do. That we got there by applying the same strategies to this game that we use in Accelerator itself — icing on the cake.

Later that night I realized an error in our execution on round 4 though. We chose to even out the rings by dropping one player from the outer ring, when we could just as easily have added a player to the inner ring: me. As scribe, I did not actively participate in the ball passing, only the planning and review. But there was no particular reason I couldn’t have stepped in. That would have increased our throughput by 20% (by increasing the number of balls in play from 15 to 18). I think we could have exceeded 150 balls passed with that configuration. So in the end, the game was a great demonstration of what is probably the most important concept from Scrum: there’s always room for improvement.

What is the fastest way to find non-zero bits in an MD5 hash?

Microbenchmarks, as a general rule, are a waste of time. Let’s just get that out of the way up front. They are also, as a general rule, totally inaccurate, measuring the execution time of some snippet of code in a context that is completely divorced from the reality in which that code will actually be used. So if after reading this article you think, “I should tell Eric what a waste of time this was!” — don’t bother. I already know.

But… microbenchmarks are also fun, and sometimes interesting, and often vastly easier to implement than a real benchmark of the same code in a production system. So a couple weeks ago, when my colleague proposed using an MD5 hash with value zero as a sentinal indicating that the checksum had not yet been calculated, I wondered: what is the fastest way to test if an MD5 hash has any non-zero bits? I had some time to kill so I wrote a microbenchmark comparing several implementations. The results are presented here for your amusement and edification.

The benchmark

The goal of the benchmark is to determine which of several methods can most quickly determine whether an MD5 hash is all zeroes. An MD5 hash is 128 bits long, so in essence this problem boils down to simply checking for non-zero bits in an arbitrary sequence of 16 bytes. You can find the benchmark source code in my Github repo.

For sample data I simply allocated about 100,000 17 byte arrays, then set one byte in each to a non-zero value. This structure made it wasy to easily test the effect of memory alignment on performance, by using either the first 16 or the last 16 bytes of each 17 byte array as the value under test. The total size was significant as well: smaller than the L1 cache on a typical modern CPU, so we avoid measuring memory bandwidth performance.

I tested the following methods for determining whether a hash is all zero, in both aligned and unaligned varieties:

  1. Naive loop over bytes: the most obvious approach simply loops over the bytes, testing each in turn.
  2. Unrolled loop over bytes: loop unrolling is a common optimization for loops with a fixed number of iterations.
  3. Bitwise OR of bytes: OR all bytes together, then compare the result to zero.
  4. Slice by four: treat the 16 bytes as an array of four 32-bit integers, testing each for equality with zero.
  5. Slice by eight: treat the 16 bytes as an array of two 64-bit integers, testing each for equality with zero.
  6. Find first set: use the GCC builtin function __builtin_ffs, which finds the first non-zero bit in a 32-bit integer.

The code was compiled to 64-bit binaries using GCC 4.7.2 with -O2 optimization and no debugging symbols.

The results

I ran the benchmarks on two systems. First I tried a quad-core hyperthreaded 1.73GHz Intel Core i7 laptop with 16GB RAM and an SSD hard drive. All cores were put into performance mode to ensure no CPU frequency scaling was enabled, using (for example) echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor. Here are the results (longer is better):

Comparison of strategies for finding zero-valued MD5 hashes (Intel)

The results are a bit erratic, to be sure — for example, it makes no sense that the unaligned version of the unrolled naive loop should be faster than the aligned version. This is likely just because the operation being measured is so fast that it’s hard to get a “pure” measurement: even tiny fluctuations in system load perturb the tests. You’ll see that if you run the benchmark a few times, you’ll get slightly different results each time. That just means that we shouldn’t make any hard-and-fast decisions based on the exact numbers here.

Nevertheless, the difference between slice-by-four or slice-by-eight and the other strategies is substantial enough that I trust the overall result, if not the exact numbers. That is, slice-by-four and slice-by-eight are clearly significantly faster than any other approach. But — and here’s where we discover the degree of yak shaving we’ve been up to — even the slowest strategy is still pretty damn fast. In all honesty, it is not going to make a lick of difference in overall application performance, unless you really do need to do billions of these checks. A realistic upper bound for my application is maybe ten million, which would consume a tiny fraction of a second using even the naive loop.

One final surprise in this data is the difference in performance between aligned and unaligned memory access — or rather, the lack therof. Conventional wisdom is that you pay a performance penalty for accessing unaligned memory, at least when you try to treat it as 32- or 64-bit blocks. In fact, this result supports other tests which indicate that on the Intel Core i7 there is effectively no penalty for unaligned memory access.

If you’re only working on x86 architectures, you may consider this exercise concluded. But we actually run our software on SPARC architecture as well, so before committing to an implementation let’s take a look at how the benchmark behaves there. This time I used a 1GHz SPARCv9 CPU:

Comparison of strategies for finding zero-valued MD5 hashes (SPARC)

Slice-by-four and slice-by-eight are fastest here too, as long as the data is aligned. If not — BOOM! The application crashes with a bus error, because the SPARC architecture is actually quite sensitive to data alignment. If you want to treat a piece of memory as an integer, it had better be properly aligned.

Conclusion

Informed by these results, we opted to use the slice-by-four strategy. That required a modification of our code, which previously did not guarantee alignment of MD5 hashes. Fortunately that modification was trivial, so it cost us little time and did not make the code any less clear. But you can see hints of the real danger of microbenchmarks: it’s often difficult for a developer to ignore the existence of a faster-but-more-complex strategy, despite evidence that the simple implementation is more than adequately performant. In this case the cost of enabling the faster implementation was negligable, but I’ve seen developers (including myself) needlessly contort code in the name of performance, doggedly defending their choices with microbenchmarks like these. Don’t let yourself become another statistic: use microbenchmarks, sure, but always evaluate the results in the larger context of overall application performance.

With that, I invite you to sound off in the comments: what did I overlook in my microbenchmark? How were the tests flawed? What other strategies do you know for testing whether a series of 16 bytes contains any non-zero bits?

6 tips for writing robust, maintainable unit tests

Unit testing is one of the cornerstones of modern software development, but there’s a surprising lack of advice about how to write good unit tests. That’s a shame, because bad unit tests are worse than no unit tests at all. Over the past decade at Electric Cloud, I’ve written thousands of tests — a full build across all platforms runs over 100,000 tests. In this post I’ll share some tips for writing robust, maintainable tests. I learned these the hard way, but hopefully you can learn from my mistakes.

A key focus here is on eliminating so-called “flaky tests”: those that work almost all the time, but fail once-in-a-great-while for reasons unrelated to the code under test. Such unreliable tests erode confidence in the test suite and even in the value of unit testing itself. In the worst case, a history of failures due to flaky tests can cause people to ignore sporadic-but-genuine test failures, allowing rarely seen but legitimate bugs into the wild.

If that’s not enough to convince you to eliminate flaky tests, consider this: suppose each of your flaky tests fails just one time out of every 10,000 runs, and that you have a thousand such tests overall. At that rate, about 10% of your CI builds will fail due to flaky tests.

Now that I’ve got your attention, let’s see how to write better tests. Got some tips of your own? Add them in the comments!

1. Never use hardcoded network port numbers

The software I write involves network communication, which means that many of the tests create network sockets. In pseudo-code, a test for the server component looks something like this:

  1. Cleanup the previous server instance (if any).
  2. Instantiate the server on port 6003.
  3. Connect to the server via port 6003.
  4. Send some message to the server via the socket.
  5. Assert that the response received is correct.

At first glance this seems pretty safe: no standard service uses port number 6003, so there won’t be any contention with third parties for that port, and by front-loading the cleanup of the server we ensure that there won’t be any contention with our other tests for the port. And, of course, it (seems to) work! In fact, it probably works almost every time you run it. But rest assured: one day, seemingly at random, this test will fail.

For us the failures started after literally years and tens of thousands of executions. We never saw the same test fail twice, but the failure mode was always the same: “Only one usage of each socket address (protocol/network address/port) is normally permitted.” I wish I could say with certainty why the failures started happening — I suspect that our anti-virus software transiently grabs a dynamic port, and sometimes it just happens to grab the port that we planned to use.

Fortunately the solution is simple: instead of using a hardcoded port number, use a dynamic port assigned by the operating system. Usually that means binding to port number zero. This is slightly less convenient than a hardcoded port number, because you’ll have to query the socket after its bound to determine the actual port number, but that’s a small price to pay to be confident you’ll never have a spurious test failure due to this mistake.

2. Never use “sleep()” to synchronize test threads.

A common mistake when writing tests that involve multiple threads is to attempt to use sleep() to synchronize events in different threads. For example, you may have a server thread which opens a socket and waits for a connection. You want to test the behavior of the server thread, but you have to be sure that the socket is opened and ready to accept connections before you try to connect to it. A quick-and-dirty approach might look like this:

  1. Start server thread.
  2. Sleep a bit to give the thread time to get started.
  3. Open a connection to the server port.

There are some problems with this strategy. If you set the delay too short, occasionally the test will fail because the server socket won’t be ready — when you try to open the client-side connection you’ll get a connection refused error. Conversely, you can make the test fairly reliable by setting the delay very long — multiple seconds — but then the test will take at least that long to run, every time you run it.

Instead you should use a condition variable to synchronize the two threads. If that is not practical or not possible, a retry loop is a decent alternative. In that case, make sure that your loop doesn’t retry so frantically that it starves the other thread of CPU cycles to make progress. I like to use an exponential backoff, so initially I get a few retries at very short intervals, then the intervals become progressively longer to give the other thread more time to work:

1
2
3
4
5
6
7
delay = 20
total = 0
while total < 5000:
# break if socket can be opened, else sleep/retry
time.sleep(delay / 1000.0)
total += delay
delay += delay

3. Never rely on timing data to verify behavior.

Another mistake is using the observed duration of an operation to assert that a particular behavior has been implemented. For example, to test that a socket read operation times out if no data is received within a set period of time, you might write test code like this:

  1. Set socket timeout to 200ms
  2. Mark start time
  3. Attempt to read from socket
  4. Mark end time
  5. Assert that end minus start is between 200ms and 250ms

The problem is that your test runner might get put to sleep between step 2 and 3, and again between step 3 and step 4 — there’s just no way to control what else might be happening on the computer executing the tests. That means that the delta between the times could be significantly higher than the 200ms you expect. Depending on your hardware and timers, it could even be less than 200ms, despite your code being implemented correctly!

There are at least two robust alternatives to implement this test. The first is to add a counter to the code under test, to be incremented whenever the read operation times out. In the test you could check the value before and after the attempted read, and reasonably expect the value to increased by exactly one. The second approach is to forgo the explicit validation altogether — if the test completes, then the timeout must be operating correctly. If the test hangs, then the timeout must not be operating.

4. Never rely on implicit file timestamps.

If your software is sensitive to file timestamps, such as for determining whether file X is newer than file Y, then avoid trusting implicit timestamps in your tests. For example, you might write test code like this:

  1. Create “old” file Y.
  2. Create “new” file X.
  3. Verify that the unit under test behaves correctly given that X is newer than y.

Superficially this looks sound: X is created after Y, so it should have a timestamp later than Y. Unless, of course, it doesn’t, which can happen for a variety of odd reasons. For example, the system clock might get adjusted underneath your feet, due to NTP or another time-synchronization system. I’ve even seen cases where the file creations happen so quickly that the two files effectively have identical timestamps!

The fix is simple: explicitly set the timestamps on the files to ensure that the relationship is as expected. Generally you should specify a difference of at least 2 full seconds, to accommodate filesystems that have very low resolution timestamps. You should also avoid relying on subsecond timestamp resolution, again to accommodate filesystems that don’t support very high-resolution timestamps.

5. Include diagnostic information in test assertions.

You can save yourself a lot of trouble by arranging your test assertions so that they provide detailed information about any failures, rather than simple telling you yes or no, the assertion passed. For example, here’s an unhelpful assertion:

1
CPPUNIT_ASSERT(errors.empty());

If the errors variable contains error text, the assertion will fail — but you’ll have no idea what the errors were, and thus you’ll have no idea what went wrong in the test. In contrast, here’s an informative assertion:

1
CPPUNIT_ASSERT_EQUAL(string(""), errors);

Now if the assertion fails the test harness will show you value of errors, so you’ll have some useful information to start your debugging.

6. Include an explanation of the test in the comments.

Finally, don’t forget to put comments in your test code! Explain how the test works, and why you believe the test actually exercises the feature that you think it tests. It may seem a bit tedious, but remember that the rest of your team may not be as familiar with the code as you are, and they may not know what steps are needed to elicit a particular response from the code. For that matter, in a few months or years you may not remember how to test what you’re trying to test. Such comments are invaluable when updating tests after a refactoring, to understand how the test should be adjusted, as well as when debugging — to understand why a test failed to expose a defect. Here’s an example:

1
2
3
4
5
# Test methodology: create a Foo object, then try to set
# the Froznitz attribute to 5. This should produce an error
# because 5 is not a valid Froznitz value. See that the right
# type of exception is thrown, and that the error text is
# correct.

Summary

Everybody knows it’s important to write unit tests. Following the suggestions here will help make sure that your tests are reliable and maintainable. If you have tips of your own, add them in the comments!

The ElectricAccelerator 6.2 “Ship It!” Award

Obviously with ElectricAccelerator 6.2 out the door, it’s time for a new “Ship It!” award. I picked the mechanic figure for this release because the main thrust of the release was to add some long-desired robustness improvements. Here’s the trading card that accompanied the figure:

Greased Lighting — it’s electrifyin’!

Loaded with metrics and analysis goodness!

As promised this iteration of the award includes some metrics comparing this release to previous releases:

  • Number of days in development. We spent 112 days working on the 6.2 release; the range of all feature releases is 80 days on the low end to 249 days at the high end.
  • JIRA issues closed. We closed out 92 issues with this release, including both defects and enhancement requests. The fewest we’ve done was 9 issues; the most was 740 issues.
  • Composition of issues. Of the 92 issues, about 55% were classified as “defects”, and the remaining 45% were “features” of varying magnitude.

Again, a big “Thank You!” goes out to the ElectricAccelerator team! I’m really excited to be working with such a talented group, and I can’t wait to show the world what we’re doing next!

The ElectricAccelerator 6.1 “Ship It!” Award

Having shipped ElectricAccelerator 6.1, I thought you might like to see the LEGO-based “Ship It!” award that I gave each member of the development team. I started this tradition with the 6.0 release last fall. Here’s the baseball card that accompanied the detective minifig I chose for this release:

The great detective is on the case!

The Accelerator 6.1 team

I picked the detective minifig for the 6.1 release in recognition of the significant improvements to Accelerator’s diagnostic capabilities (like cyclic redundancy checks to detect faulty networks, and MD5 checksums to detect faulty disks). Compared to the 6.0 award not much has changed in the design, although I did get my hands on the “official” corporate font this time. It strikes me that there’s a lot of wasted space on the back of the card though. Next time I’ll make better use of the space by incorporating statistics about the release. I actually have the design all ready to go, but you’ll have to wait until after the release to see it. Don’t fret though, the 6.2 release is expected soon!

Makefile hacks: automatically split long command lines

If you’ve worked on a large build system you’ve probably bumped into this error, or one like this:

gmake: execvp: /bin/sh: Argument list too long

This error means the length of some command-line in your makefile has grown past the system limit, which is typically in the 32 to 256 kilobyte range. It’s surprisingly easy to hit that limit. You start with a small list of object files to be linked together. Over time you add more, and the command-line gets a little longer. Add a few more and it gets longer still. Before you know it you have a monster command-line and your build starts failing.

The solution to this problem is simple: split the long command-line into several shorter command-lines. For example, ar r libraries/lib.a objects/foo.o objects/bar.o objects/baz.o objects/boo.o objects/bang.o becomes something like this:

ar r libraries/lib.a objects/foo.o objects/bar.o
ar r libraries/lib.a objects/baz.o objects/boo.o
ar r libraries/lib.a objects/bang.o

Simple in theory, but tedious to do by hand. And doing it manually is like putting a ticking time-bomb into your makefile — it’s only a matter of time before your build grows enough that you have to go through this exercise again.

I recently ran across a clever solution that exploits the $(eval) function in GNU make to split long command-lines automatically, eliminating the tedium and the time-bomb. After I show you the solution, I’ll explain it piece-by-piece.

The max_args function

The solution is a user-defined function called max_args that splits long command-lines into equal-length chunks:

1
2
3
4
5
6
7
8
9
define max_args
$(eval _args:=)
$(foreach obj,$3,$(eval _args+=$(obj))$(if $(word $2,$(_args)),$1$(_args)$(EOL)$(eval _args:=)))
$(if $(_args),$1$(_args))
endef
define EOL
endef

And an example of its use:

1
2
3
OBJS:=a b c d e f g h
all:
@$(call max_args,echo,2,$(OBJS))

The max_args function takes three parameters: the base command-line, the number of arguments per “chunk”, and the complete list of arguments. It expands to a series of command-lines — one for each chunk of arguments.

The trick behind max_args is the use of $(eval) to update a variable as a side-effect of gmake’s regular variable expansion activity. If you’re not familiar with gmake variable expansion, here’s a quick rundown: when gmake finds a variable or function reference, like $(something), it replace the entire reference with an expanded value. In the case of a variable that’s just the value of the variable. Most variables in gmake are recursive which means that if the variable value itself contains embedded variable references, those will be expanded as well, recursively. In the case of a function, gmake evaluates the function, and replaces the reference with the computed value.

The meat of max_args is on line 3. It starts with the $(foreach) function, which evaluates its third argument, the body of the loop, once for each word in its second argument — in this case, the list of objects passed in the call to max_args.

In max_args, the loop body has two components. The first is a call to $(eval), which simply appends the current value of the loop variable to an accumulator called _args.

The second component of the loop body uses $(if) and $(word) to check the length of _args. The $(word) function returns the nth word from a list, or an empty string if there are fewer than n words in the list. The $(if) function expands its second argument (the then clause) only if its first argument (the condition) expands to a non-empty string, so together these functions check if _args has the desired number of words, and if so the then clause of the $(if) is expanded.

The then clause of this $(if) has two components. The first constructs a completed command-line by concatenating the base command-line, here given by $1, the first argument to the original max_args call; the accumulated arguments; and a newline character. Thanks to the rules of gmake expansion, this command-line is added to the overall expansion result for the max_args function. The second part of the then clause uses $(eval) to reset the accumulator

If the chunk size does not evenly divide the number of arguments, the stragglers are emitted in a final command-line on the last line of max_args.

Limitations

max_args is handy but it has one significant limitation: command-line length limits are based on the number of bytes in the command-line, not the number of words, in it. Unfortunately, gmake has no built-in way to count the number of characters in a string. gmake does provide the $(words) built-in, so that’s what max_args uses. That just means that to use it effectively you have to take a guess at the number of arguments that will fit in a single command-line, for example by dividing the length limit by the average number of characters in each argument, then subtracting some to allow some buffer for outliers.