performance | eric melski's blog.melski.net

What’s new in CloudBees Build Acceleration 12.0?

Just in time for the new year, this month we released CloudBees Build Acceleration 12.0, the 40th feature release of the product previously known as CloudBees Accelerator and before that ElectricAccelerator. This is possibly the most significant update for most end users since the 8.0 release in 2015, thanks to a massive overhaul and expansion of the Build Details page in the Cluster Manager that puts metrics, visualization and even recommendations just a click away in your browser. We also improved jobcache, by adding context-sensitive hashing for C/C++ source files — no more cache misses when you change comments! — as well as content-sensitive hashing for Unix archives, and support for caching Kotlin compilation. Finally, we added an enhancement to our GNU make emulation to automatically create output directories: no more need for messy order-only prereqs or sentinel files or mkdir -p $(dirname $@) all over your makefiles. Keep reading for screenshots and more details.

Build Details

The improvements I’m most excited about in Build Acceleration 12.0 are the sweeping updates to the Build Details page in the Cluster Manager. These are designed to give you access to build visualization and performance analysis, right from the comfort of the browser. Much of this functionality has been available for a long time as part of Insight, a desktop application for build visualization and analysis, but we found that few users took advantage of that functionality. We hope that by automatically collecting the data and providing it via the Cluster Manager web interface, more users will be able to leverage that analysis not only to see the benefit they derive from using Build Acceleration, but to better monitor and improve performance. The update consists of a redesigned UI framework for the Build Details page, as well as several new or enhanced sub-tabs:

The Settings tab shows both the user-specified options used in the build, as well as any other properties determined by emake itself, like the OS version and others.
The Environment tab shows the environment variables in effect when emake was invoked.
The Performance tab shows dozens of individual performance metrics, such as network and disk bandwidth and compression performance, as well as the number of agents in use over the duration of the build and the critical path through the build, or the serialized set of jobs that determines the minimum possible duration of the build.
The Jobcache tab shows metrics relating to the use of jobcache in the build, including the overall cache hit rate, the estimated time saved due to caching, and the specific types of jobcache used. You can also find the portion of the total build workload that was cached.
The Composition tab shows a breakdown of the work performed during the build according to the semantic classification of that work, such as compilation, linking, or packaging. Clicking on any of the categories shows the longest jobs in the build belonging to that category.
The Timeline tab shows a visualization of the build’s execution. For efficiency reasons (both in terms of rendering and backend storage) detailed information is only available for non-trivial jobs in the build. Shorter jobs are aggregated into blocks so they can still be seen in this visualization. If you want to see more details than are available in this visualization, you can run CloudBees Build Acceleration Insight on the annotation file from the build.
The Diagnostics tab presents warnings and error messages culled from the build output log, using analysis similar to that found in CloudBees CD (formerly ElectricFlow).
Finally, the Recommendations tab presents suggestions for ways to improve build performance and an estimate of the impact of implementing those suggestions. The list is prioritized according to that estimate. Of course this is not an exhaustive list of ways to improve performance — instead, you should see these recommendations as a starting point for build optimization. Today the report checks for several common types of performance gotchas; in the future we hope to add more.

If you’ve used Build Acceleration Insight in the past, some of the information you’ll find in the Build Details page now will seem familiar. The nice thing is that you no longer have to manually remember to run Insight, and you can access this analysis from any browser that can access the Cluster Manager, even for builds that were run on different hosts and for which the annotation file may not be available. I truly believe that having this information easily accessible will enable more users to “self serve” when it comes to performance analysis, effectively making everybody a Build Acceleration “super user”.

Note: in order to use the new Build Details, you must upgrade both the Cluster Manager and emake to 12.0 or later. Enhanced analysis is not available for builds run using older versions of emake.

Build Signature and Totality

Tucked into the Build Details screenshots above you may have noticed a couple additional fields in the header: signature and totality. These new build properties make it possible to identify builds that are building the same stuff, and whether a build is full or incremental:

The signature is simply a hash of the names of all the output targets in the build, in serial order. If you run the same build repeatedly, you should get the same signature for each run. If you add or remove modules or targets in the build, the signature will change. If the builds are entirely different, such as of different packages, the signatures will be different.

The totality of the build reflects the percentage of rule jobs in the build that were determined to be out-of-date during the run. In theory a full-from-scratch build will have a totality of 100%, although many builds have duplicate targets and other quirks that result in the totality being less than that even for a full-from-scratch build. Each build is unique, so you’ll have to observe the behavior of this value in your own builds to know what is normal for your configuration. Conversely, a “no touch” build should have a totality of 0%, but again many builds have rules that are always run, so even in a “no touch” build the totality is not quite 0%. Again, you’ll have to observe the behavior in your builds to understand what is normal for your configuration. In general comparing totality from dissimilar builds (that is, those that have a different signature) is not useful, but you can use it to distinguish full (or mostly full) builds from incremental or no touch builds when the signatures are the same. Remember too that totality is a continuum: if 100% is a full-from-scratch build and 0% is a no touch build, an incremental that rebuilds only a few outputs might have a totality of 10% or 20%, while an incremental that rebuilds many outputs might have a totality of 70% or 80%.

JobCache Enhancements

CloudBees Build Acceleration 12.0 also includes several improvements to the jobcache feature to improve performance, increase cache hit rates and expand the types of work that can be cached. Chief among these is an intelligent hasher for C/C++ source code, which enables emake to ignore comments and blank lines in those files when determining whether an input has changed. That means that these two fragments are considered equivalent:

/* Generated 2020-DEC-15 11:57:32 */
#include "util.h"
static const int DEBUG = 1;

/* Generated 2020-DEC-18 15:12:47 */
#include "util.h"
static const int DEBUG = 1;

Previously a change like this would have caused a jobcache miss — technically correct, but unfortunate since obviously the code in question has not actually changed in a meaningful way. With this enhancement, emake correctly recognizes that fact, and you’ll get a jobcache hit (assuming no other changes, of course).

Along the same lines, we added an intelligent hasher for Unix archive files — .a or .la files. In this case, emake now knows to ignore the timestamps embedded in the archive, while still considering the content of the members of the archive. Again this allows for a greater number of cache hits in practical usage.

Next, we extended the javac jobcache type so it applies to Kotlin compilation in addition to Java compilation. Kotlin is a programming language that is used extensively in the Android ecosystem, and which is interoperable with Java — in fact in most cases Kotlin is compiled into Java byte code. Expanding the scope of jobcache to include Kotlin enables caching of even more work in Android system builds.

Finally, in this release we changed how emake stores timestamp data in jobcache entries. Previously, if a cached job set an explicit timestamp on a file (something like touch -t 202012010000 foo), that timestamp would be recreated precisely as it was saved in the cache, even if the cache was used days or months after it was originally initialized. That lead to some surprising behaviors, because running a build today might result in build outputs with a timestamp from some time far in the past. In turn that caused unwanted rework in incremental builds, because some timestamps were very old, but others (from outputs created by uncached jobs, for example) were current. With this change, emake no longer saves explicit timestamp modifications in the jobcache, so outputs pulled from the cache are always given a timestamp reflecting the time at which the outputs were created in the current build.

Automatically creating output directories

Although there are many other improvements in the 12.0 release, there’s one more in particular that I think warrants a mention, because it directly addresses a problem that many make users contend with: how to efficiently, succinctly, and correctly create parent directories for output targets in the build. This is a topic that has been written about often and which should be familiar to anybody who’s had to maintain a make-based build. Essentially the question is: how can we make sure the directories for outputs in a makefile will be created before the outputs themselves are created? Of course if you fail to create the directories, the build will fail. It’s true there are a variety of ways to solve this in GNU make, but truthfully they are all kind of clunky, requiring some combination of redundant mkdir commands (which waste time), or sentry files (which create clutter), or for the user to remember to add extra prereqs all over the place. Other build tools, like ninja, have a pretty tidy solution: the build tool itself just automatically creates the output directory just before it is needed. This is clean, simple, and efficient — and now, if you use emake, you can get the same behavior for your make-based build by adding --emake-create-output-dirs=1 to your invocation. In this mode, emake will automatically create special jobs to handle making output directories in the most efficient possible way, with no makefile modifications required.

Availability

As you can see, I’m pretty excited about the 12.0 release of CloudBees Build Acceleration. I can’t wait to see how people make use of the new Build Details information to understand and optimize their builds, and I’m always amazed that even after 40 releases we’re still finding ways to make builds faster than ever. I hope you’ll upgrade soon.

CloudBees Build Acceleration 12.0 is available immediately for current users, and new users can download a free trial.

What factors affect build performance?

Recently a customer asked me to help them create a list of factors that affect build performance. They found themselves often tasked with explaining to their developers why one build had worse performance than another, or with finding ways to further improve the performance of a build. This is a very big, very complex question — I think perhaps much more so than they realized at first! In fact I think the question as posed is fundamentally unanswerable: I could never give an exhaustive list of the factors that affect build performance. There are too many, and there are surely some that I myself have yet to see — “unknown unknowns” as they say.

Nevertheless, there is value in making a list, even if incomplete, if for no other reason then to serve as a reference for people trying to understand or improve the performance of their builds. What follows is my attempt at creating that list, roughly in order of importance — but bear in mind that this ordering is somewhat subjective, and highly situation-dependent: your mileage may vary, and different builds will have different specific bottlenecks.

Factors that affect build performance

Build size
Builds can be measured in many ways: number of output targets, number of input files, total lines of code, aggregate bytes of output generated, etc. Generally speaking, the bigger the build is, the longer it will take to complete. If your build is long simply because of its size, you may think you have no opportunities, but that’s not so: parallel builds, caching outputs, componentization and beefier hardware can all help cope with this type of problem.
Execution parallelism
Most build tools support some form of parallel execution — GNU make’s -j option is the classic example. Assuming there is parallelizable work in the build, the build that runs on more cores (that is, with a higher -j value) will complete more quickly.
Available parallelism / build structure
Running a build on many cores only helps if there is exploitable parallelism in the build. If the build is defined in such a way that parallelism is limited, then it will take longer to complete regardless of the -j value. For example, some builds may have unnecessary serializations in the dependency graph, which will limit performance.
Caching
The use of (or failure to use) caching technology, such as ElectricAccelerator’s JobCache, ClearCase winkins, or ccache, can dramatically impact build performance. In my tests caching such as JobCache can reduce build duration by 50% or more for full builds.
Conflicts (ElectricAccelerator only)
For builds executed with ElectricAccelerator, conflicts can have a significant impact on performance. Briefly, an Accelerator conflict is any time your build “loses” a race condition between two steps that should have been serialized but weren’t due to missing dependencies in your makefile or build files. Accelerator can detect and correct such errors on-the-fly, but it comes at a cost. A few conflicts is usually not a problem, but if you have hundreds or thousands, it will make your build slower as Accelerator reruns portions of the build to get the correct results. Usually if you see such a scenario it’s a sign that you didn’t have a complete Accelerator history file for your build, so fixing such issues is as easy as using the history file generated by that build to augment the dependency information in future runs of the build.
Code complexity and structure
There are many attributes of the code itself which can affect build performance. For example, as a general rule, very long source files will take longer to compile than shorter files. Files containing very long individual functions will take longer to compile than files with only short functions. Heavy use of templates in C++ will cause slower compilation. Careless use of #include statements in C or C++ code will cause slower compilation and can be especially harmful to incremental builds by triggering excessive recompilation.
Implementation language
Some languages are easier to process and therefore faster to compile. In general, C++ is slower to compile, while languages like Java or Go are very quick to compile. Some languages require no compilation at all, so builds of code using such languages can be very very fast indeed!
Build tool
There are a staggering array of build tools which you might choose to drive your build: GNU make, ninja, maven, ant, scons, emake, tup and more. Some were designed for high performance on full builds, while others were designed for high performance on incremental builds, and still others were designed for ease-of-use, correctness, or other non-performance related attributes. The choice of build tool will affect the performance of your build, especially if your build is very large.
Compiler
For compiled languages like C and C++ there are often many different compilers that you could choose from for your build: gcc/g++, clang, icc, WindRiver, Microsoft cl, tcc, etc. These tools themselves have different performance profiles, and the performance may even vary from one version of the compiler to the next.
Compiler options
For a given compiler, the build options you enable may significantly affect the compile time. For example, when using gcc, building with -O3 is generally slower than building with -O0. Therefore for developer builds, you may consider to disable optimizations in order to reduce build cycle time. Other options that may influence compile speed include: pre-compiled headers (PCH); dependency generation (-MD, -MMD, -MF, etc); profiling or coverage analysis (-fprofile-arcs or -ftest-coverage); and include path definitions (-I), which if very long can cause the compiler to spend excessive time searching for header files.
Linker
As with the compiler, different linkers have different performance characteristics. For C/C++ compilation on Linux the default linker is GNU ld, but there are alternatives like Google gold which have much better performance, albeit for a subset of the use cases supported by GNU ld. If your use case is supported by gold, you will likely see much better build performance by switching.
Memory
As with any process involving computers, the amount of available memory will have a significant impact on build performance. Too little and your system will swap excessively. Fortunately there’s no such thing as “too much”, though it may be prohibitively expensive to get so much RAM that you can stop worrying about it. In practice most builds do not require a huge amount of memory, but if yours do, and you don’t have enough, your build speed will suffer.
Disk performance
Like memory, the performance of your disk can significantly affect build performance. In fact its easier in some ways to understand the impact of disk speed. If the build generates 10GB of output and your disk can only write at 10MB/s, the fastest that the build can possibly finish is about 1,000 seconds, or nearly 20 minutes. On the other hand, if the build generates only 5MB of output and uses the same disk, then only 1/2 second is needed to write the build outputs, so the disk is unlikely to be a bottleneck. You may find that the disk is adequate for your builds now, but as the build gets bigger you will reach a point where the disk is no longer fast enough. At that point you can upgrade to a faster disk, and that will be sufficient for some time until again your build grows to exceed the capacity of the disk.

Even if your disk is not a primary bottleneck now, switching to a faster disk may improve performance somewhat. Many users have had good results from switching to SSD for temporary storage, or using striped RAID for those builds that generate truly enormous amounts of data.
Network performance
For distributed builds such as those executed with ElectricAccelerator, network performance is crucial because build data has to be transferred across the network. But even if the build itself is not distributed, it may make use of tools pulled from a network file share, so the network performance can affect the build.
Operating system / kernel version
Some operating systems have better performance for builds than others — in general, I’ve found builds on Linux to be relatively faster than builds on Windows, for example. Likewise, some versions of the operating system may be faster than others. Some users have reported as much as a 3x improvement by upgrading from an old version of Linux to a newer version due to optimizations in the kernel itself.
Anti-virus software
Use of anti-virus software can dramatically impact build performance, particularly if the A/V is configured in one of the more aggressive or intrusive modes of operation: sometimes every file operation is intercepted by the A/V scanner, adding a substantial drag on build speed.
License management
Some build tools, such as certain commercial compilers, require licenses in order to operate. If the license system is misconfigured it can add delays to the build process, sometimes causing each compile to take minutes instead of seconds as the compiler tries and fails to contact a license server, or contacts the wrong license server instance (for example, one on a different subnet).

A foundation for performance investigations

So there you have it: my (not entirely) comprehensive list of factors that can affect build performance. Of course these won’t all be relevant for every build: every build is different, and each has a unique performance profile. A slow disk may be mostly irrelevant for one build but absolutely critical for another. My hope is that this list will serve as a foundation for your build performance investigations — something to help get you started, even if it doesn’t get you all the way to a conclusion.

What do you think of my list? What would you add, and how would you change the ordering? Let me know in the comments below.

What’s new in ElectricAccelerator 7.2?

Wow, time flies! Another six months has come and gone, which means it’s time for another ElectricAccelerator feature release. Right on cue, ElectricAccelerator 7.2 dropped a couple weeks back on April 17, 2014. There’s no unifying theme to this release — actually we’re in the middle of a much more ambitious project that I can’t say much about quite yet, but over the last several months we’ve made a number of improvements to Accelerator core functionality, and we’re eager to get those out users. Thus we have the 7.2 release, with the following marquee features: dramatic Linux performance improvements for certain use cases, a key enhancement to our parse avoidance feature to improve accuracy, and expanded Linux platform support. Read on for the details.

Linux performance improvements

Accelerator 7.2 incorporates two performance improvements for Linux-based builds. The first is a redesign of the integration between the Electric File System (EFS) and the Linux kernel, which reduces lock contention in the EFS. Consequently, any build job that makes concurrent accesses to the filesystem should see some performance improvement. In one example, a build that executed two tar processes simultaneously in one job saw runtime drop from 11 minutes with Accelerator 7.1 to just 6 minutes with Accelerator 7.2, nearly 2x faster!

The second improvement is full support for the Linux d_type extension to the readdir() system call. On most Unix and Unix-like systems, the readdir() system call only gives the application programmer a couple pieces of information: the names of the files in a directory, and the inode number for those files. On Linux, filesystems may also include file type information in the results, which enables programs to operate more efficiently in some cases as they can avoid the overhead of an addition stat() call to get the file type. On a local filesystem that optimization is interesting but not necessarily game-changing; but with a distributed network filesystem like the EFS, that optimization can result in enormous improvements. In our benchmarks we saw jobs using find to scan large directory structures execute nearly 9x faster with Accelerator 7.2 versus Accelerator 7.1

Parse avoidance update

For large builds with many or complicated makefiles, Accelerator’s parse avoidance feature is a game-changer by dramatically reducing the time necessary to read and interpret makefiles at the start of a build. On the Android KitKat open-source build, parse avoidance reduces a 4 minute parse job to about 5 seconds — nearly 50x faster!. Since its introduction in Accelerator 7.0, parse avoidance has delivered jaw-dropping improvements like that in a wide variety of builds.

But use of this feature has been problematic in one specific use case: makefiles that use wildcards in prerequisite lists, with either $(wildcard) or $(shell). In certain circumstances this makefile anti-pattern could cause emake to produce “false positives” from the parse avoidance cache, such that emake would incorrectly use a previously cached parse result when it should have instead reparsed the makefile. In Accelerator 7.2 we’ve extended the #pragma cache syntax so that you can inform emake of the wildcard patterns to consider when determining cache suitability. This will enable even more users to enjoy the benefits of parse avoidance, without sacrificing reliability or performance. Usage instructions can be found in the Electric Make User’s Guide.

New platform support

Finally, with Accelerator 7.2 we’ve further expanded our already sweeping platform support to include RedHat Enterprise Linux 6.5, Ubuntu 13.04 and Windows Server 2012. This may seem like a modest increment, but I’m particularly excited about this update not for the what but for the who: you see, this is the first time that somebody other than myself made all the updates needed to support a new version of Linux, start to finish. With another set of hands to do that work we should be able to add support for new Linux platforms much more quickly in the future, which is welcome news indeed (thanks, Tim)!

What’s next?

Years ago, I thought that we would eventually get to the point that Accelerator was “done” and we’d have nothing left to do. How young and foolish I was! In reality, it seems that the TODO list only gets longer and longer. We’re still working hard on the “buddy cluster” concept, as well as Bitbake and ninja integration. And of course, we’re always working to improve performance — more on that in a future post.

ElectricAccelerator 7.2 is available immediately for existing customers. Contact support@electric-cloud.com to get the bits. New users can contact sales@electric-cloud.com for a evaluation.

UPDATE: SCons is Still Really Slow

A while back I posted a series of articles exploring the scalability of SCons, a popular Python-based build tool. In a nutshell, my experiments showed that SCons exhibits roughly quadratic growth in build runtimes as the number of targets increases:

Recently Dirk Baechle attempted to rebut my findings in an entry on the SCons wiki: Why SCons is not slow. I thought Dirk made some credible suggestions that could explain my results, and he did some smart things in his effort to invalidate my results. Unfortunately, his methods were flawed and his conclusions are invalid. My original results still stand: SCons really is slow. In the sections that follow I’ll share my own updated benchmarks and show where Dirk’s analysis went wrong.

Test setup

As before, I used genscons.pl to generate sample builds ranging from 2,000 to 50,000 targets. However, my test system was much beefier this time:

	2013	2010
OS	Linux Mint 14 (kernel version 3.5.0-17-generic)	RedHat Desktop 3 (kernel version 2.4.21-58.ELsmp)
CPU	Quad 1.7GHz Intel Core i7, hyperthreaded	Dual 2.4GHz Intel Xeon, hyperthreaded
RAM	16 GB	2 GB
HD	SSD	(unknown)
SCons	2.3.0	1.2.0.r3842
Python	2.7.3 (system default)	2.6.2

Before running the tests, I rebooted the system to ensure there were no rogue processes consuming memory or CPU. I also forced the CPU cores into “performance” mode to ensure that they ran at their full 1.7GHz speed, rather than at the lower 933MHz they switch to when idle.

Revisiting the original benchmark

I think Dirk had two credible theories to explain the results I obtained in my original tests. First, Dirk wondered if those results may have been the result of virtual memory swapping — my original test system had relatively little RAM, and SCons itself uses a lot of memory. It’s plausible that physical memory was exhausted, forcing the OS to swap memory to disk. As Dirk said, “this would explain the increase of build times” — you bet it would! I don’t remember seeing any indication of memory swapping when I ran these tests originally, but to be honest it was nearly 4 years ago and perhaps my memory is not reliable. To eliminate this possibility, I ran the tests on a system with 16 GB RAM this time. During the tests I ran vmstat 5, which collects memory and swap usage information at five second intervals, and captured the result in a log.

Next, he suggested that I skewed the results by directing SCons to inherit the ambient environment, rather than using SCons’ default “sanitized” environment. That is, he felt I should have used env = Environment() rather than env = Environment(ENV = os.environ). To ensure that this was not a factor, I modified the tests so that they did not inherit the environment. At the same time, I substituted echo for the compiler and other commands, in order to make the tests faster. Besides, I’m not interested in benchmarking the compiler — just SCons! Here’s what my Environment declaration looks like now:

env = Environment(CC = 'echo', AR = 'echo', RANLIB = 'echo')

With these changes in place I reran my benchmarks. As expected, there was no change in the outcome. There is no doubt: SCons does not scale linearly. Instead the growth is polynomial, following an n^1.85 curve. And thanks to the the vmstat output we can be certain that there was absolutely no swapping affecting the benchmarks. Here’s a graph of the results, including an n^1.85 curve for comparison — notice that you can barely see that curve because it matches the observed data so well!

For comparison, I used the SCons build log to make a shell script that executes the same series of echo commands. At 50,000 targets, the shell script ran in 1.097s. You read that right: 1.097s. Granted, the shell script doesn’t do stuff like up-to-date checks, etc., but still — of the 3,759s average SCons runtime, 3,758s — 99.97% — is SCons overhead.

I also created a non-recursive Makefile that “builds” the same targets with the same echo commands. This is a more realistic comparison to SCons — after all, nobody would dream of actually controlling a build with a straight-line shell script, but lots of people would use GNU make to do it. With 50,000 targets, GNU make ran for 82.469s — more than 45 times faster than SCons.

What is linear scaling?

If the performance problems are so obvious, why did Dirk fail to see them? Here’s a graph made from his test results:

Dirk says that this demonstrates “SCons’ linear scaling”. I find this statement baffling, because his data clearly shows that SCons does not scale linearly. It’s simple, really: linear scaling just means that the build time increases by the same amount for each new target you add, regardless of how many targets you already have. Put another way, it means that the difference in build time between 1,000 targets and 2,000 targets is exactly the same as the difference between 10,000 and 11,000 targets, or between 30,000 and 31,000 targets. Or, put yet another way, it means that when you plot the build time versus the number of targets, you should get a straight line with no change in slope at any point. Now you tell me: does that describe Dirk’s graph?

Here’s another version of that graph, this time augmented with a couple additional lines that show what the plot would look like if SCons were truly scaling linearly. The first projection is based on the original graph from 2,500 to 4,500 targets — that is, if we assume that SCons scales linearly and that the increase in build time between 2,500 and 4,500 targets is representative of the cost to add 2,000 more targets, then this line shows us how we should expect the build time to increase. Similarly, the second projection is based on the original graph between 4,500 and 8,500 targets. You can easily see that the actual data does not match either projection. Furthermore you can see that the slope of these projections is increasing:

This shows the importance of testing at large scale when you’re trying to characterize the scalability of a system from empirical data. It can be difficult to differentiate polynomial from logarithmic or linear at low scales, especially once you incorporate the constant factors — polynomial algorithms can sometimes even give better absolute performance for small inputs than linear algorithms! It’s not until you plot enough data points at large enough values, as I’ve done, that it becomes easy to see and identify the curve.

What does profiling tell us?

Next, Dirk reran some of his tests under a profiler, on the very reasonable assumption that if there was a performance problem to be found, it would manifest in the profiling data — surely at least one function would demonstrate a larger-than-expected growth in runtime. Dirk only shared profiling data for two runs, both incremental builds, at 8,500 and 16,500 targets. That’s unfortunate for a couple reasons. First, the performance problem is less apparent on incremental builds than on full builds. Second, with only two datapoints it is literally not possible to determine whether growth is linear or polynomial. The results of Dirk’s profiling was negative: he found no “significant difference or increase” in any function.

Fortunately it’s easy to run this experiment myself. Dirk used cProfile, which is built-in to Python. To profile a Python script you can inject cProfile from the command-line, like this: python -m cProfile scons. Just before Python exits, cProfile dumps timing data for every function invoked during the run. I ran several full builds with the profiler enabled, from 2,000 to 20,000 targets. Then I sorted the profiling data by function internal time (time spent in the function exclusively, not in its descendents). In every run, the same two functions appeared at the top of the list: posix.waitpid and posix.fork. To be honest this was a surprise to me — previously I believed the problem was in SCons’ Taskmaster implementation. But I can’t really argue with the data. It makes sense that SCons would spend most of its time running and waiting for child processes to execute, and even that the amount of time spent in these functions would increase as the number of child processes increases. But look at the growth in runtimes in these two functions:

Like the overall build time, these curves are obviously non-linear. Armed with this knowledge, I went back to Dirk’s profiling data. To my surprise, posix.waitpid and posix.fork don’t even appear in Dirk’s data. On closer inspection, his data seems to include only a subset of all functions — about 600 functions, whereas my profiling data contains more than 1,500. I cannot explain this — perhaps Dirk filtered the results to exclude functions that are part of the Python library, assuming that the problem must be in SCons’ own code rather than in the library on which it is built.

This demonstrates a second fundamental principle of performance analysis: make sure that you consider all the data. Programmers’ intuition about performance problems is notoriously bad — even mine! — which is why it’s important to measure before acting. But measuring won’t help if you’re missing critical data or if you discard part of the data before doing any analysis.

Conclusions

On the surface, performance analysis seems like it should be simple: start a timer, run some code, stop the timer. Done correctly, performance analysis can illuminate the dark corners of your application’s performance. Done incorrectly — and there are many ways to do it incorrectly — it can lead you on a wild goose chase and cause you to squander resources fixing the wrong problems.

Dirk Baechle had good intentions when he set out to analyze SCons performance, but he made some mistakes in his process that led him to an erroneous conclusion. First, he didn’t run enough large-scale tests to really see the performance problem. Second, he filtered his experimental data in a way that obscured the existence of the problem. But perhaps his worst mistake was to start with a conclusion — that there is no performance problem — and then look for data to support it, rather than starting with the data and letting it impartially guide him to an evidence-based conclusion.

To me the evidence seems indisputable: SCons exhibits roughly quadratic growth in runtimes as the number of build targets increases, rendering it unusable for large-scale software development (tens of thousands of build outputs). There is no evidence that this is a result of virtual memory swapping. Profiling suggests a possible pair of culprits in posix.waitpid and posix.fork. I leave it to Dirk and the SCons team to investigate further; in the meantime, you can find my test harness and test results in my GitHub repo. If you can see a flaw in my methodology, sound off in the comments!

What’s new in ElectricAccelerator 7.1

ElectricAccelerator 7.1 hit the streets a last month, on October 10, just six months after the 7.0 release in April. There are some really cool new features in this release, which picks up right where 7.0 left off by adding even more ground-breaking performance features: schedule optimization and Javadoc caching. Here’s a quick look at each.

Schedule Optimization

The idea behind schedule optimization is really simple: we can reduce overall build duration if we’re smarter about the order in which jobs are run. In essense, it’s about packing the jobs in tighter, eliminating idle time in the middle of the build and reducing the “ragged right edge”. Here’s a side-by-side comparison of the same build, first using normal scheduling and then using schedule optimization. You can easily see that schedule optimization made the second build faster — an 11% improvement in this small, real-world example:

Build using naive scheduling — click to view full size

Build using schedule optimization – click to view full size

If you study the two runs more closely, you can see how schedule optimization produced this improvement: key jobs, in particular the longest jobs, were started earlier. As a result, idle time in the middle of the build was reduced or eliminated entirely, and the right edge is much less uneven. But the best part? It’s completely automatic: all you have to do is run the build once for emake to learn its performance profile. Every subsequent build will leverage that data to improve build performance, almost like magic.

Not convinced? Here’s a look at the impact of schedule optimization on another, much bigger proprietary build (serial build time 18h25m). The build is already highly parallelizable and achieves an impressive 37.2x speedup with 48 agents — but schedule optimization can reduce the build duration by nearly 25% more, bringing to total speedup on 48 agents to an eye-popping 47.5x!

Build duration with naive and optimized scheduling

There’s another interesting angle to schedule optimization though. Most people will take the performance gains and use them to get a faster build on the same hardware. But you could go the other direction just as easily — keep the same build duration, but do it with dramatically less hardware. The following graph quantifies the savings, in terms of cores needed to achieve a particular build duration. Suppose we set a target build duration of 30 minutes. With naive scheduling, we’d need 48 agents to meet that target. With schedule optimization, we need only 38.

Resource requirements with naive and optimized scheduling – click for full size

I’m really excited about schedule optimization, because it’s one of those rare features that give you something for nothing. It’s also been a long time coming — the idea was originally conceived of over three years ago, and it’s only now that we were able to bring it to fruition.

Schedule optimization works with emake on all supported platforms, with all emulation modes. It is not currently available for use with electrify.

Javadoc caching

The second major feature in Accelerator 7.1 is Javadoc caching. Again, it’s a simple idea: think “ccache”, but for Javadoc instead of compiles. This is the next phase in the evolution of Accelerator’s output reuse initiative, which began in the 7.0 release with parse avoidance. Like any output reuse feature, Javadoc caching works by capturing the product of a Javadoc invocation and storing it in a cache indexed by a hash of the inputs used — including the Java files themselves, the environment variables, and the command-line. In subsequent builds, emake will check those inputs again and if it computes the same hash, emake will used the cached results instead of running Javadoc again. On big Javadoc jobs, this can produce significant savings. For example, in the Android “Jelly Bean” open-source build, the main Javadoc invocation usually takes about five minutes. With Javadoc caching in Accelerator 7.1, that job runs in only about one minute — an 80% reduction! In turn that gives us a full one minute reduction in the overall build time, dropping the build from 13 minutes to 12 — nearly a 10% improvement:

Uncached Javadoc job in Android build – click for full image

Cached Javadoc job in Android build - click for full build

Cached Javadoc job in Android build – click for full image

Javadoc caching is available on Solaris and Linux only in Accelerator 7.1.

Looking ahead

I hope you’re as excited about Accelerator 7.1 as I am — for the second time this year, we’re bringing revolutionary new performance features to the table. But of course our work is never done. We’ve been hard at work on the “buddy cluster” concept for the next release of Accelerator. Hopefully I’ll be able to share some screenshots of that here before the end of the year. We’re also exploring acceleration for Bitbake builds like the Yocto Project. And last, but certainly not least, we’ll soon start fleshing out the next phase of output reuse in Accelerator — caching compiler invocations. Stay tuned!

What’s new in ElectricAccelerator 7.0

ElectricAccelerator 7.0 was officially released a couple weeks ago now, on April 12, 2013. This version, our 26th feature release in 11 years, incorporates performance features that are truly nothing less than revolutionary: dependency optimization and parse avoidance. To my knowledge, no other build tool in the world has comparable functionality, is working on comparable functionality or is even capable of adding such functionality. Together these features have enabled us to dramatically cut Android 4.1.1 (Jelly Bean) build times, compared to Accelerator 6.2:

Full, from-scratch builds are 35% faster
“No touch” incremental builds are an astonishing 89% faster

In fact, even on this highly optimized, parallel-friendly build, Accelerator 7.0 is faster than GNU make, on the same number of cores. On a 48-core system gmake -j 48 builds Android 4.1.1 in 15 minutes. Accelerator 7.0 on the same system? 12 minutes, 21 seconds: 17.5% faster.

Read on for more information about the key new features in ElectricAccelerator 7.0.

Dependency optimization: use only what you need

Dependency optimization is a new application of the data that is used to power Accelerator’s conflict detection and correction features. But where conflict detection is all about finding missing dependencies in makefiles, dependency optimization is focused on finding surplus dependencies, which drag down build performance by needlessly limiting parallelism. Here’s a simple example:

1

2

3

4

5

foo: bar

	@echo abc > foo && sleep 10


bar:
	@echo def > bar && sleep 10

In this makefile you can easily see that the dependency between foo and bar is superfluous. Unfortunately GNU make is shackled by the dependencies specified in the makefile and is thus obliged to run the two jobs serially. In contrast, with dependency optimization enabled emake can detect this inefficiency and ignore the unnecessary dependency — so foo and bar will run in parallel.

Obviously you could trivially fix this simple makefile, but in real-world builds that may be difficult or impossible to do manually. For example, in the Android 4.1.1 build, there are about 2 million explicitly specified dependencies in the makefiles. For a typical variant build, only about 300 thousand are really required: over 85% of the dependencies are unnecessary. And that's in the Android build, which is regarded by some as a paragon of parallel-build cleanliness — imagine the opportunities for improvement in builds that don't have Google's resources to devote to the problem.

To enable dependency optimization in your builds, add --emake-optimize-deps=1 to your emake command-line. The first build with that option enabled will "learn" the characteristics of the build; the second and subsequent builds will use that information to improve performance.

Parse avoidance: the fastest job is the one you don't have to do

A common complaint with large build systems is incremental build performance — specifically, the long lag between the time that the user invokes make and the time that make starts the first compile. Some have even gone so far as to invent entirely new build tools with a specific focus on this problem. Parse avoidance delivers similar performance gains without requiring the painful (perhaps impossible!) conversion to a new build tool. For example, a "no touch" incremental build of Android 4.1.1 takes close to 5 minutes with Accelerator 6.2, but only about 30 seconds with Accelerator 7.0.

On complex builds, a large portion of the lag comes from parsing makefiles. The net result of that effort is a dependency graph annotated with targets and the commands needed to generate them. The core idea underpinning parse avoidance is the realization that we need not redo that work on every build. Most of the time, the dependency graph, et al, is unchanged from one build to the next. Why not cache the result of the parse and reuse it in the next build? So that's what we did.

To enable parse avoidance in your builds, add --emake-parse-avoidance=1 to your emake command-line. The first build with that option will generate a parse result to add to the cache; the second and subsequent builds will reload the cached result in lieu of reparsing the makefiles from scratch.

Other goodies

In addition to the marquee features, Accelerator 7.0 includes dozens of other improvements. Here are some of the highlights:

Limited GNU make 3.82 support. emake now allows assignment modifiers (like ?=, etc.) on define-style variable definitions, when --emake-emulation=gmake3.82
Order-only prerequisites in NMAKE emulation mode. GNU make introduced the concept of order-only prerequisites in 3.80. With this release we've extended our NMAKE emulation with the same concept.
Enhancements to electrify. The biggest improvement is the ability to match full command-lines to decide whether or not a particular command should be executed remotely (Linux only). Previously, electrify could only match against the process name.

What's next?

In my opinion, Accelerator 7.0 is the most exciting release we've put out in close to two years, with truly ground-breaking new functionality and performance improvements. It's not often that you can legitimately claim double-digit percentage performance improvements in a mature product. I'm incredibly proud of my team for this accomplishment.

With that said: there's always room to do more. We're already gearing up for the next release. The exact release content is not yet nailed down, but on the short list of candidates is a new job scheduler, to enable still better performance; "buddy cluster" facilities, to allow the use of Accelerator without requiring dedicated hardware; and possibly some form of acceleration for Maven-based builds. Let's go!

What is the fastest way to find non-zero bits in an MD5 hash?

Microbenchmarks, as a general rule, are a waste of time. Let’s just get that out of the way up front. They are also, as a general rule, totally inaccurate, measuring the execution time of some snippet of code in a context that is completely divorced from the reality in which that code will actually be used. So if after reading this article you think, “I should tell Eric what a waste of time this was!” — don’t bother. I already know.

But… microbenchmarks are also fun, and sometimes interesting, and often vastly easier to implement than a real benchmark of the same code in a production system. So a couple weeks ago, when my colleague proposed using an MD5 hash with value zero as a sentinal indicating that the checksum had not yet been calculated, I wondered: what is the fastest way to test if an MD5 hash has any non-zero bits? I had some time to kill so I wrote a microbenchmark comparing several implementations. The results are presented here for your amusement and edification.

The benchmark

The goal of the benchmark is to determine which of several methods can most quickly determine whether an MD5 hash is all zeroes. An MD5 hash is 128 bits long, so in essence this problem boils down to simply checking for non-zero bits in an arbitrary sequence of 16 bytes. You can find the benchmark source code in my Github repo.

For sample data I simply allocated about 100,000 17 byte arrays, then set one byte in each to a non-zero value. This structure made it wasy to easily test the effect of memory alignment on performance, by using either the first 16 or the last 16 bytes of each 17 byte array as the value under test. The total size was significant as well: smaller than the L1 cache on a typical modern CPU, so we avoid measuring memory bandwidth performance.

I tested the following methods for determining whether a hash is all zero, in both aligned and unaligned varieties:

Naive loop over bytes: the most obvious approach simply loops over the bytes, testing each in turn.
Unrolled loop over bytes: loop unrolling is a common optimization for loops with a fixed number of iterations.
Bitwise OR of bytes: OR all bytes together, then compare the result to zero.
Slice by four: treat the 16 bytes as an array of four 32-bit integers, testing each for equality with zero.
Slice by eight: treat the 16 bytes as an array of two 64-bit integers, testing each for equality with zero.
Find first set: use the GCC builtin function __builtin_ffs, which finds the first non-zero bit in a 32-bit integer.

The code was compiled to 64-bit binaries using GCC 4.7.2 with -O2 optimization and no debugging symbols.

The results

I ran the benchmarks on two systems. First I tried a quad-core hyperthreaded 1.73GHz Intel Core i7 laptop with 16GB RAM and an SSD hard drive. All cores were put into performance mode to ensure no CPU frequency scaling was enabled, using (for example) echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor. Here are the results (longer is better):

The results are a bit erratic, to be sure — for example, it makes no sense that the unaligned version of the unrolled naive loop should be faster than the aligned version. This is likely just because the operation being measured is so fast that it’s hard to get a “pure” measurement: even tiny fluctuations in system load perturb the tests. You’ll see that if you run the benchmark a few times, you’ll get slightly different results each time. That just means that we shouldn’t make any hard-and-fast decisions based on the exact numbers here.

Nevertheless, the difference between slice-by-four or slice-by-eight and the other strategies is substantial enough that I trust the overall result, if not the exact numbers. That is, slice-by-four and slice-by-eight are clearly significantly faster than any other approach. But — and here’s where we discover the degree of yak shaving we’ve been up to — even the slowest strategy is still pretty damn fast. In all honesty, it is not going to make a lick of difference in overall application performance, unless you really do need to do billions of these checks. A realistic upper bound for my application is maybe ten million, which would consume a tiny fraction of a second using even the naive loop.

One final surprise in this data is the difference in performance between aligned and unaligned memory access — or rather, the lack therof. Conventional wisdom is that you pay a performance penalty for accessing unaligned memory, at least when you try to treat it as 32- or 64-bit blocks. In fact, this result supports other tests which indicate that on the Intel Core i7 there is effectively no penalty for unaligned memory access.

If you’re only working on x86 architectures, you may consider this exercise concluded. But we actually run our software on SPARC architecture as well, so before committing to an implementation let’s take a look at how the benchmark behaves there. This time I used a 1GHz SPARCv9 CPU:

Slice-by-four and slice-by-eight are fastest here too, as long as the data is aligned. If not — BOOM! The application crashes with a bus error, because the SPARC architecture is actually quite sensitive to data alignment. If you want to treat a piece of memory as an integer, it had better be properly aligned.

Conclusion

Informed by these results, we opted to use the slice-by-four strategy. That required a modification of our code, which previously did not guarantee alignment of MD5 hashes. Fortunately that modification was trivial, so it cost us little time and did not make the code any less clear. But you can see hints of the real danger of microbenchmarks: it’s often difficult for a developer to ignore the existence of a faster-but-more-complex strategy, despite evidence that the simple implementation is more than adequately performant. In this case the cost of enabling the faster implementation was negligable, but I’ve seen developers (including myself) needlessly contort code in the name of performance, doggedly defending their choices with microbenchmarks like these. Don’t let yourself become another statistic: use microbenchmarks, sure, but always evaluate the results in the larger context of overall application performance.

With that, I invite you to sound off in the comments: what did I overlook in my microbenchmark? How were the tests flawed? What other strategies do you know for testing whether a series of 16 bytes contains any non-zero bits?

Rapidly detecting Linux kernel source features for driver deployment

A while back I wrote about genconfig.sh, a technique for auto-detecting Linux kernel features, in order to improve portability of an open source, out-of-tree filesystem driver I developed as part of ElectricAccelerator. genconfig.sh has enabled me to maintain compatibility across a wide range of Linux kernels with relative ease, but recently I noticed that the script was unacceptably slow. On the virtual machines in our test lab, genconfig.sh required nearly 65 seconds to execute. For my 11-person team, a conservative estimate of time wasted waiting for genconfig.sh is nearly an entire person-month per year. With a little effort, I was able to reduce execution time to about 7 seconds, nearly 10x faster! Here’s how I did it.

A brief review of genconfig.sh

genconfig.sh is a technique for detecting the source code features of a Linux kernel. Like autoconf configure scripts, genconfig.sh uses a series of trivial C source files to probe for various kernel source features, like the presence or absence of the big kernel lock. For example, here’s the code used to determine whether the kernel has the set_nlink() helper function:

1

2

3

4

5

#include <linux/fs.h>

void dummy(struct inode *i)

{

    set_nlink(i, 0);

}

If a particular test file compiles successfully, the feature is present; if the compile fails, the feature is absent. The test results are used to set a series of C preprocessor #define directives, which in turn are used to conditionally compile driver code suitable for the host kernel.

Reaching the breaking point

When I first implemented genconfig.sh in 2009 we only supported a few versions of the Linux kernel. Since then our support matrix has swollen to include every variant from 2.6.9 through 3.5.0, including quirky “enterprise” distributions that habitually backport advanced features without changing the reported kernel version. But platform support tends to be a mostly one-way street: once something is in the matrix, it’s very hard to pull it out. As a consequence, the number of feature tests in genconfig.sh has grown too, from about a dozen in the original implementation to over 50 in the latest version. Here’s a real-time recording of a recent version of genconfig.sh on one of the virtual machines in our test lab:

genconfig.sh executing on a test system; click for full size

Accelerator actually has two instances of genconfig.sh, one for each of the two kernel drivers we bundle, which means every time we install Accelerator we burn about 2 minutes waiting for genconfig.sh — 25% of the 8 minutes total it takes to run the install. All told I think a conservative estimate is that this costs my team nearly one full person-month of time per year, between time waiting for CI builds (which do automated installs), time waiting for manual installs (for testing and verification) and my own time spent waiting when I’m making updates to support new kernel versions.

genconfig.sh: The Next Generation

I had a hunch about the source of the performance problem: the Linux kernel build system. Remember, the pattern repeated for each feature test in the original genconfig.sh is as follows:

Emit a simple C source file, called test.c.
Invoke the kernel build system, using make and a trivial kernel module makefile:

1

2

3

conftest-objs := test.o

obj-m := conftest.o

EXTRA_CFLAGS += -Werror
Check the exit status from make to decide whether the test succeeded or failed.

The C sources used to probe for features are trivial, so it seemed unlikely that the compilation itself would take so long. But we don’t have to speculate — if we use Electric Make instead of GNU make to run the test, we can use the annotated build log and ElectricInsight to see exactly what’s going on:

ElectricInsight visualization of Linux kernel module build, click for full size.

Overall, using the kernel build system to compile this single file takes nearly 2 seconds — not a big deal with only a few tests, but it adds up quickly. To be clear, the only part we actually care about is the box labeled /root/__conftest__/test.o, which is about 1/4 second. The remaining 1 1/2 seconds? Pure overhead. Perhaps most surprising is the amount of time burned just parsing the kernel makefiles — the huge bright cyan box on the left edge, as well as the smaller bright cyan boxes in the middle. Nearly 50% of the total time is just spent parsing!

At this point an idea struck me: there’s no particular reason genconfig.sh must invoke the kernel build system separately for each probe. Why not write out all the probe files upfront and invoke the kernel build system just once to compile them all in a single pass? In fact, with this strategy you can even use parallel make (eg, make -j 4) to eke out a little bit more speed.

Of course, you can’t use the exit status from make with this approach, since there’s only one invocation for many tests. Instead, genconfig.sh can give each test file a descriptive name, and then check for the existence of the corresponding .o file after make finishes. If the file is present, the feature is present; otherwise the feature is absent. Here’s the revised algorithm:

Emit a distinct C source file for each kernel feature test. For example, the sample shown above might be created as set_nlink.c. Another might be write_begin.c.
Invoke the kernel build system, using make -j 4 and a slightly less trivial kernel module makefile:

1

2

3

conftest-objs := set_nlink.o write_begin.o ...

obj-m := conftest.o

EXTRA_CFLAGS += -Werror
Check for the existence of each .o file, using something like if [ -f set_nlink.o ] ; then … ; fi to decide whether the test succeeded or failed.

The net result? After an afternoon of refactoring, genconfig.sh now completes in about 7 seconds, nearly 10x faster than the original:

Updated genconfig.sh executing on a test system, click for full size.

The only drawback I can see is that the script no longer has that satisfying step-by-step output, instead dumping everything out at once after a brief pause. I’m perfectly happy to trade that for the improved performance!

Future Work and Availability

This new strategy has significantly improved the usability of genconfig.sh. After I finished the conversion, I wondered if the same technique could be applied to autoconf configure scripts. Unfortunately I find autoconf nearly impossible to work with, so I didn’t make much progress exploring that idea. Perhaps one of my more daring (or stubborn) readers will take the ball and run with it there. If you do, please comment below to let us know the results!

The new version of genconfig.sh is used in ElectricAccelerator 6.2, and can also be seen in the open source Loopback File System (lofs) on my Github repo.

Fixing recursive make

Recursive make is one of those things that everybody loves to hate. It’s even been the subject of one of those tired “… Considered Harmful” diatribes. According to popular opinion, recursive make will sap performance from your build, make it nigh impossible to ensure correctness in parallel builds, and may render the user sterile. OK, maybe not that last one. But seriously, the arguments against recursive make are legion, and deeply entrenched. The problem? They’re flawed. That’s because they assume there’s only one way to implement recursive make — when the submake is invoked, the parent make is blocked until the submake completes. That’s how almost everybody does it. But in Electric Make, part of ElectricAccelerator, we developed a novel new approach called non-blocking recursive make. This design eliminates the biggest problems attributed to recursive make, without requiring a painful and costly conversion of your build system to non-recursive make.

The problem with traditional recursive make

There’s really just two problems at the heart of complaints with traditional recursive make: first, there’s no way to ensure correctness of a parallel recursive make based build without overserializing the submakes, because there’s no way to articulate dependencies between individual targets in different submakes. That means you can’t have a dependency graph that is both correct and precise. Instead you either leave out the critical dependency entirely, which makes parallel (ie, fast) builds unreliable; or you serialize submakes in their entirety, which shackles build performance because no part of a submake with even a single dependency on some portion of an earlier submake can begin until the entire ealier submake completes. Second, even if there were a way to specify precise dependencies between targets in different submakes, most versions of make have implemented recursive make such that the parent make is blocked from proceeding until the submake has completed. Consider a typical use of recursive make with implicit serializations between submakes:

1

2

3

4

all:

	@for dir in util client server ; do \

	    $(MAKE) -C $$dir; \

	done

Each submake compiles a bunch of source files, then links them together into a library (util) or an executable (client and server). The only actual dependency between the work in the three make instances is that the client and server programs need the util library. Everything else is parallelizable, but with traditional recursive make, gmake is unable to exploit that parallelism: all of the work in the util submake must finish before any part of the client submake begins!

Conflict detection and non-blocking recursive make

If you’re familiar with Electric Make, you already know how it solves the first half of the recursive make problem: conflict detection and correction. I’ve written about conflict detection before, but here’s a quick recap: using the explicit dependencies given in the makefiles and information about the files accessed as each target is built, emake is able to dynamically determine when targets have been built too early due to missing explicit dependencies, and rerun those targets to generate the correct output. Electric Make can ensure the correctness of parallel builds even in the face of incomplete dependencies, even if the missing dependencies are between targets in different submakes. That means you need not serialize entire submakes to ensure the build will run correctly in parallel.

Like an acrobat’s safety net, conflict detection allows us to consider solutions to the other half of the problem that would otherwise be considered risky, if not outright madness. In fact, our solution would not be possible without conflict detection: non-blocking recursive make. This is analogous to the difference between blocking and non-blocking I/O: rather than waiting for a recursive make to finish, emake carries on executing subsequent commands in the build immediately, including other recursive makes. Conflict detection ensures that only the commands in each submake which require serialization are executed sequentially, so the build runs as quickly as possible, but the final build output is identical to a serial build.

The impact of this change is dramatic. Here I’ve plotted the execution of the simple build defined above on four cores, using both gmake (normal recursive make) and emake (non-blocking recursive make):

Recursive make build with gmake

Recursive make build with emake

Electric Make is able to execute this build about 20% faster than gmake, with no changes to the Makefiles or the execution environment. emake is literally able to squeeze more parallelism out of recursive-make-based builds than gmake. In fact, we can precisely quantify just how much more parallelism emake gets through an application of Amdahl’s law. First, we compute the best possible speedup for the build — that’s just the serial runtime divided by the best possible parallel runtime, which we can figure out through analysis of the depedency graph and runtime of individual jobs in the build (the Longest Serial Chain report in ElectricInsight can do this for you). Then we can compute the parallelizable portion P of the build by plugging the speedup S into this equation: P = 1 – (1 / S). Here’s how that works out for gmake and emake:

	gmake	emake
Serial baseline	65s	65s
Best build time	13.5s	7.5s
Best speedup	4.8x	8.7x
Parallel portion	79%	89%

On this build, non-blocking recursive make increases the parallel portion of the build by 10%. That may not seem like much, but Amdahl’s law shows how dramatically that difference affects the speedup you can expect as you apply more cores:

Implementation

On the backend, non-blocking recursive make is handled by conflict detection — the jobs from the recursive make are checked for conflicts in the serial order defined by the makefile structure. Any issues caused by aggressively running recursive makes early are detected during the conflict check, and the target that ran too early is rerun to generate the correct result.

On the frontend, emake uses a strategy that is at once both brilliant in its simplicity, and diabolical in its trickery. It starts with an environment variable. When emake is invoked recursively, it checks the value of EMAKE_BUILD_MODE. If it is set to node, emake runs in so-called stub mode: rather than executing the submake (parsing the makefile and building targets), emake captures the invocation context (working directory, command-line and environment) in a file on disk, prints a “magic” string and exits with a zero status code.

The file containing the invocation context is identified by a second environment variable, ECLOUD_RECURSIVE_COMMAND_FILE. The Accelerator agent (which handles invoking commands on behalf of emake) checks for the presence of that file after every command that is run. If it is found, the agent relays the content to the toplevel emake invocation, where a new make instance is created to represent the submake invocation. That instance comes with it’s own parse job of course, which gets inserted into the queue of jobs. Some (short) time later, the parse job will run, discover whatever work must be run by the submake, and create additional rule jobs.

The magic string — EMAKE_FNORD — serves as a placeholder in the stdout stream for the jobs, so emake can figure out which portion of the output text comes before and which portion comes after the submake. This ensures that the build output log is identical to that generated by a serialized gmake build. For example, given the following rule that invokes a submake, you’d expect to see the “Before” and “After” messages printed before and after the output generated by commands in the submake itself:

1

2

3

4

all:

	@echo Before util ; \

	@$(MAKE) -C util ; \

	@echo After util

With non-blocking recursive make, the submake has not actually executed when the “echo After util” command runs. If emake doesn’t account for that reordering, both the “Before” and “After” messages will appear before any of the output from the submake. EMAKE_FNORD allows emake to “stitch” the output together so the build log matches a serial log.

Limitations

Conflict detection and non-blocking recursive make together solve the main problems associated with recursive make. But there are a couple scenarios where non-blocking recursive make does not work well. Fortunately, these are uncommon in practice and easily addressed.

Capturing recursive make stdout

The first scenario is when the build captures the output of the recursive make invocation, rather than letting it print to stdout as normal. Since emake defers the execution of the submake and prints only EMAKE_FNORD to stdout, this will not work. There are two reasons you might do this: first, you might want to have separate build logs for each submake, to simplify error detection and management. In this situation, the simplest workaround is to remove the redirection and instead us emake’s annotated build log, an XML version of the build output log which can be easily processed using standard tools. Second, you may be using make as a text-processing tool (sort of a “poor man’s” Perl), rather than for building per se:

1

2

3

all:

	@$(MAKE) -f genlist.mk > objects.txt

	@cat objects.txt | xargs rm

In this case, the workaround is to explicitly force emake to run in so-called “local” mode, which means emake will handle the recursive make invocation as a blocking invocation, just like traditional make would. You can force emake into local mode by adding EMAKE_BUILD_MODE=local to the environment before the recursive make invocation.

Immediate consumption of build products

The second scenario is when the build consumes the product of the submake in the same command that contains the invocation. For example:

1 2	`all:` `@$(MAKE) -C sub foo && cp sub/foo ./foo`

Here the build assumes that the output files generated by the submake will be available for use immediately after the submake completes. Obviously this is not the case with non-blocking recursive make — when the invocation of $(MAKE) -C sub foo completes, only the submake stub has actually finished. The build products will not be available until after the submake is actually processed later. Note that in this build both the recursive make invocation and the commands that use the build products from that invocation are treated as a single command from the perspective of make: make actually invokes the shell, and the shell then runs the recursive make and cp commands.

The workaround is simple: split the consumer into a distinct command, from the perspective of make:

1

2

3

all:

	@$(MAKE) -C sub foo

	@cp sub/foo ./foo

With that trivial change, emake is able to treat the cp as a continuation job, which can be serialized against the completion of the recursive make as needed.

A fix for recursive make

For years, people have heaped scorn and criticism on recursive make. They’ve nearly convinced everybody that even considering its use is automatically wrong — you probably can’t help feeling a little bit guilty when you use recursive make. But the reality is that recursive make is a reasonable way to structure a large build. You just need a better make. With conflict detection and non-blocking recursive make, Electric Make has fixed the problems usually associated with recursive make, so you can get parallel builds that are both fast and correct. Give it a try!

Another confusing conflict in ElectricAccelerator

After solving the case of the confounding conflict, my user came back with another scenario where ElectricAccelerator produced an unexpected (to him) conflict:

1

2

3

4

5

6

all:

	@$(MAKE) foo

	@cp foo bar

foo:

	@sleep 2 && echo hello world > foo

If you run this build without a history file, using at least two agents, you will see a conflict on the continuation job that executes the cp foo bar command, because that job is allowed to run before the job that creates foo in the recursive make invocation. After one run of course, emake records the dependency in history, so later builds don’t make the same mistake.

This situation is a bit different from the symlink conflict I showed you previously. In that case, it was not obvious what caused the usage that triggered the conflict (the GNU make stat cache). In this case, it’s readily apparent: the continuation job reads (or attempts to read) foo before foo has been created. That’s pretty much a text-book example of the sort of thing that causes conflicts.

What’s surprising in this example is that the continuation job is not automatically serialized with the recursive make that precedes it. In a very real sense, a continuation job is an artificial construct that we created for bookkeeping reasons internal to the implementation of emake. Logically we know that the commands in the continuation job should follow the commands in the recursive make. In fact it would be absolutely trivial for emake to just go ahead and stick in a dependency to ensure that the continuation is not allowed to start until after the recursive make finishes, thereby avoiding this conflict even when you have no history file.

Given a choice between two strategies that both produce correct output, emake uses the strategy that produces the best performance in the general case.

Absolutely trivial to do, yes — but also absolutely wrong. Not for correctness reasons, this time, but for performance. Remember, emake is all about maximizing performance across a broad range of builds. Given a choice between two strategies that both produce correct output, emake uses the strategy that produces the best performance in the general case. For continuation jobs, that means not automatically serializing the continuation against the preceding recursive make. I could give you a wordy, theoretical explanation, but it’s easier to just show you. Suppose that your makefile looked like this instead of the original — the difference here is that the continuation job itself launches another recursive make, rather than just doing a simple cp:

all:

	@$(MAKE) foo

	@$(MAKE) bar

foo:

	@sleep 2 && echo hello world > foo

bar:

	@sleep 2 && echo goodbye > bar

Hopefully you agree that the ideal execution of this build would have both foo and bar running in parallel. Forcing the continuation job to be serialized with the preceding recursive make would choke the performance of this build. And just in case you’re thinking that emake could be really clever by looking at the commands to be executed in the continuation job, and only serializing “when it needs to”: it can’t. First, that would require emake to implement an entire shell syntax parser (or several, really, since you can override SHELL in your makefile). Second, even if emake had that ability, it would be thwarted the instant the command is something like my_custom_script.pl — there’s no way to tell what will happen when that gets invoked. It could be a simple filesystem access. It could be a recursive make. It could be a whole series of recursive makes. Even when the command is something you think you recognize, can emake really be sure? Maybe cp is not our trustworthy standard Unix cp, but something else entirely.

Again, all is not lost for this user. If you want to avoid this conflict, you have a couple options:

Use a good history file from a previous build. This is the simplest solution. You’ll only get conflicts in this build if you run without a history file.
Refactor the makefile. You can explicitly describe the dependency between the commands in the continuation job and the recursive make by refactoring the makefile so that the stuff in the continuation is instead its own target, thus taking the decision out of emake’s hands. Here’s one way to do that:

1

2

3

4

5

6

7

8

all: do_foo

@cp foo bar

do_foo:

@$(MAKE) foo

foo:

@sleep 2 && echo hello world > foo

Either of these will eliminate the conflict from your build.

Build Details

Build Signature and Totality

JobCache Enhancements

Automatically creating output directories

Availability

Share this:

Factors that affect build performance

A foundation for performance investigations

Share this:

Linux performance improvements

Parse avoidance update

New platform support

What’s next?

Share this:

Test setup

Revisiting the original benchmark

What is linear scaling?

What does profiling tell us?

Conclusions

Share this:

Schedule Optimization

Javadoc caching

Looking ahead

Share this:

Dependency optimization: use only what you need

Parse avoidance: the fastest job is the one you don't have to do

Other goodies

What's next?

Share this:

The benchmark

The results

Conclusion

Share this:

A brief review of genconfig.sh

Reaching the breaking point

genconfig.sh: The Next Generation

Future Work and Availability

Share this:

The problem with traditional recursive make

Conflict detection and non-blocking recursive make

Implementation

Limitations

Capturing recursive make stdout

Immediate consumption of build products

A fix for recursive make

Share this:

Share this: