Programming | eric melski's blog.melski.net

What factors affect build performance?

Recently a customer asked me to help them create a list of factors that affect build performance. They found themselves often tasked with explaining to their developers why one build had worse performance than another, or with finding ways to further improve the performance of a build. This is a very big, very complex question — I think perhaps much more so than they realized at first! In fact I think the question as posed is fundamentally unanswerable: I could never give an exhaustive list of the factors that affect build performance. There are too many, and there are surely some that I myself have yet to see — “unknown unknowns” as they say.

Nevertheless, there is value in making a list, even if incomplete, if for no other reason then to serve as a reference for people trying to understand or improve the performance of their builds. What follows is my attempt at creating that list, roughly in order of importance — but bear in mind that this ordering is somewhat subjective, and highly situation-dependent: your mileage may vary, and different builds will have different specific bottlenecks.

Factors that affect build performance

Build size
Builds can be measured in many ways: number of output targets, number of input files, total lines of code, aggregate bytes of output generated, etc. Generally speaking, the bigger the build is, the longer it will take to complete. If your build is long simply because of its size, you may think you have no opportunities, but that’s not so: parallel builds, caching outputs, componentization and beefier hardware can all help cope with this type of problem.
Execution parallelism
Most build tools support some form of parallel execution — GNU make’s -j option is the classic example. Assuming there is parallelizable work in the build, the build that runs on more cores (that is, with a higher -j value) will complete more quickly.
Available parallelism / build structure
Running a build on many cores only helps if there is exploitable parallelism in the build. If the build is defined in such a way that parallelism is limited, then it will take longer to complete regardless of the -j value. For example, some builds may have unnecessary serializations in the dependency graph, which will limit performance.
Caching
The use of (or failure to use) caching technology, such as ElectricAccelerator’s JobCache, ClearCase winkins, or ccache, can dramatically impact build performance. In my tests caching such as JobCache can reduce build duration by 50% or more for full builds.
Conflicts (ElectricAccelerator only)
For builds executed with ElectricAccelerator, conflicts can have a significant impact on performance. Briefly, an Accelerator conflict is any time your build “loses” a race condition between two steps that should have been serialized but weren’t due to missing dependencies in your makefile or build files. Accelerator can detect and correct such errors on-the-fly, but it comes at a cost. A few conflicts is usually not a problem, but if you have hundreds or thousands, it will make your build slower as Accelerator reruns portions of the build to get the correct results. Usually if you see such a scenario it’s a sign that you didn’t have a complete Accelerator history file for your build, so fixing such issues is as easy as using the history file generated by that build to augment the dependency information in future runs of the build.
Code complexity and structure
There are many attributes of the code itself which can affect build performance. For example, as a general rule, very long source files will take longer to compile than shorter files. Files containing very long individual functions will take longer to compile than files with only short functions. Heavy use of templates in C++ will cause slower compilation. Careless use of #include statements in C or C++ code will cause slower compilation and can be especially harmful to incremental builds by triggering excessive recompilation.
Implementation language
Some languages are easier to process and therefore faster to compile. In general, C++ is slower to compile, while languages like Java or Go are very quick to compile. Some languages require no compilation at all, so builds of code using such languages can be very very fast indeed!
Build tool
There are a staggering array of build tools which you might choose to drive your build: GNU make, ninja, maven, ant, scons, emake, tup and more. Some were designed for high performance on full builds, while others were designed for high performance on incremental builds, and still others were designed for ease-of-use, correctness, or other non-performance related attributes. The choice of build tool will affect the performance of your build, especially if your build is very large.
Compiler
For compiled languages like C and C++ there are often many different compilers that you could choose from for your build: gcc/g++, clang, icc, WindRiver, Microsoft cl, tcc, etc. These tools themselves have different performance profiles, and the performance may even vary from one version of the compiler to the next.
Compiler options
For a given compiler, the build options you enable may significantly affect the compile time. For example, when using gcc, building with -O3 is generally slower than building with -O0. Therefore for developer builds, you may consider to disable optimizations in order to reduce build cycle time. Other options that may influence compile speed include: pre-compiled headers (PCH); dependency generation (-MD, -MMD, -MF, etc); profiling or coverage analysis (-fprofile-arcs or -ftest-coverage); and include path definitions (-I), which if very long can cause the compiler to spend excessive time searching for header files.
Linker
As with the compiler, different linkers have different performance characteristics. For C/C++ compilation on Linux the default linker is GNU ld, but there are alternatives like Google gold which have much better performance, albeit for a subset of the use cases supported by GNU ld. If your use case is supported by gold, you will likely see much better build performance by switching.
Memory
As with any process involving computers, the amount of available memory will have a significant impact on build performance. Too little and your system will swap excessively. Fortunately there’s no such thing as “too much”, though it may be prohibitively expensive to get so much RAM that you can stop worrying about it. In practice most builds do not require a huge amount of memory, but if yours do, and you don’t have enough, your build speed will suffer.
Disk performance
Like memory, the performance of your disk can significantly affect build performance. In fact its easier in some ways to understand the impact of disk speed. If the build generates 10GB of output and your disk can only write at 10MB/s, the fastest that the build can possibly finish is about 1,000 seconds, or nearly 20 minutes. On the other hand, if the build generates only 5MB of output and uses the same disk, then only 1/2 second is needed to write the build outputs, so the disk is unlikely to be a bottleneck. You may find that the disk is adequate for your builds now, but as the build gets bigger you will reach a point where the disk is no longer fast enough. At that point you can upgrade to a faster disk, and that will be sufficient for some time until again your build grows to exceed the capacity of the disk.

Even if your disk is not a primary bottleneck now, switching to a faster disk may improve performance somewhat. Many users have had good results from switching to SSD for temporary storage, or using striped RAID for those builds that generate truly enormous amounts of data.
Network performance
For distributed builds such as those executed with ElectricAccelerator, network performance is crucial because build data has to be transferred across the network. But even if the build itself is not distributed, it may make use of tools pulled from a network file share, so the network performance can affect the build.
Operating system / kernel version
Some operating systems have better performance for builds than others — in general, I’ve found builds on Linux to be relatively faster than builds on Windows, for example. Likewise, some versions of the operating system may be faster than others. Some users have reported as much as a 3x improvement by upgrading from an old version of Linux to a newer version due to optimizations in the kernel itself.
Anti-virus software
Use of anti-virus software can dramatically impact build performance, particularly if the A/V is configured in one of the more aggressive or intrusive modes of operation: sometimes every file operation is intercepted by the A/V scanner, adding a substantial drag on build speed.
License management
Some build tools, such as certain commercial compilers, require licenses in order to operate. If the license system is misconfigured it can add delays to the build process, sometimes causing each compile to take minutes instead of seconds as the compiler tries and fails to contact a license server, or contacts the wrong license server instance (for example, one on a different subnet).

A foundation for performance investigations

So there you have it: my (not entirely) comprehensive list of factors that can affect build performance. Of course these won’t all be relevant for every build: every build is different, and each has a unique performance profile. A slow disk may be mostly irrelevant for one build but absolutely critical for another. My hope is that this list will serve as a foundation for your build performance investigations — something to help get you started, even if it doesn’t get you all the way to a conclusion.

What do you think of my list? What would you add, and how would you change the ordering? Let me know in the comments below.

What’s new in ElectricAccelerator 9.0?

Just a couple months ago, in October 2016, we released ElectricAccelerator 9.0. This version includes some really exciting new functionality and unlocks even more amazing performance than ever before. For the first time since 2008 we added support for a new build tool: ninja, an ultra-fast new make-like build tool and the workhorse at the center of the build for both chromium and Android (yes, that Android). And we’ve continued to expand the JobCache feature — a generalization of the parse avoidance feature introduced in Accelerator 7.0. With Accelerator 9.0 you can cache more types of work, including GCC/G++ compiles, clang compiles, Microsoft cl compiles, javac and javadoc, and Google’s new Jack compiler for Java code. Even better, you can share cached results with other developers to amplify the gains across an entire team. Read on for details.

Ninja emulation

Accelerator 9.0 introduces support for the ninja-based builds. Ninja is a very interesting build tool: conceptually similar to make, but radically simplified (at least so far!). Gone are things like built-in functions, pattern rules, vpath, conditional directives, and all the other things that make it hard to parse and evaluate makefiles quickly. This enables the ninja parser to evaluate “ninja files” unbelievably quickly, but at the cost of making ninja files verbose and ill-suited for creation by hand. Instead, ninja files are typically generated from some other process, such as CMake. The benefit to the end user then is extremely fast incremental builds: for example, in Android 6.0, using the original make-based build system, a no-touch build could take as much as a minute to run even though there’s no work to be done. In Android 7.0, using the new ninja-based build system, the same build can be completed in about 5 seconds!

ElectricAccelerator’s emulation of ninja is, I think, remarkably anticlimactic: to execute a ninja build, simply invoke emake –emake-emulation=ninja. That’s it. Here’s a very simple “Hello, world!” ninja file:

1

2

3

4

rule echo

  description = Building $out

  command = echo "Hello, world!"

build foo: echo

And the result of running this with emake –emake-emulation=ninja:

1

2

3

4

5

6

$ emake --emake-emulation=ninja

Starting build: local-32601

Building foo

Hello, world!

Finished build: local-32601   Duration: 0:00 (m:s)

$

As I said, it’s utterly uninteresting, which, quixotically, makes it very interesting: the integration is seamless and it “just works”. Even better, by running your ninja build with ElectricAccelerator you automatically and instantly take advantage of all the advanced acceleration and correctness features you’ve come to love about Accelerator: conflict detection, history, schedule optimization, annotation, even jobcache. It all just works.

JobCache Enhancements

In Accelerator 7.0 we introduced parse avoidance, a mechanism for caching the result of makefile parsing in one build in order to accelerate subsequent builds. Once we had shown that this type of caching could dramatically improve build performance we refactored the code behind parse avoidance to create a general purpose caching framework dubbed JobCache and in subsequent releases we’ve steadily expanded the types of work to which jobcache can be applied:

Accelerator 7.1: jobcache for Javadoc generation
Accelerator 8.0: jobcache for C/C++ compiles using clang/gcc/g++ (comparable to, but better than, ccache)
Accelerator 8.1: jobcache for C/C++ compiles using Microsoft cl

In Accelerator 9.0 we’ve expanded the reach of jobcache in two ways. First, we added support for caching javac and Jack compiles. Next, we added shared jobcache, which enables a team of developers to leverage jobcache collectively and reliably, eliminating redundant work across the entire team.

With shared jobcache, the team designates a “blessed” or “golden” build process to populate the cache — typically the nightly or continuous integration builds. This build simply uses jobcache as normal, using –emake-assetdir to specify a location on a shared filesystem to host the cache. Then, each developer explicitly requests to use the shared cache by adding –emake-shared-assetdir to the command-line when they invoke emake, specifying the same location. Once enabled, emake uses both the shared cache and the private cache during the build. For each job that uses jobcache:

Check the shared jobcache for a matching entry.
1. If a match is found in the shared jobcache, use it. Done!
2. If a match is not found in the shared jobcache, continue.
Check the private jobcache for a matching entry.
1. If a match is found in the private jobcache, use it. Done!
2. If a match is not found in the private jobcache, continue.
Run the job as normal
Save the result to the private cache

Note that the shared cache is never written to by the developers’ builds: updates are only saved in the private cache. In this way we can ensure that developers’ builds to not litter the shared cache with one-off or user-specific cache entries. Typically we expect that developers will see very good cache hit rates against the shared cache, perhaps 95% or better, since each developer modifies only a small fraction of the total source code at once. Thus shared jobcache multiples the savings from jobcache by the size of the team.

Dynamic file patching

The final feature of interest in Accelerator 9.0 is dynamic file patching. This is a mechanism by which emake can patch files on the fly as they are referenced during the build, based on the name, size and MD5 checksum of the original. This feature enables users to tweak build scripts or makefiles in order to improve performance or compatibility with Accelerator — critical in environments where there is limited ability to modify the original files directly.

Looking forward to 9.1

Accelerator 9.0 contains some really tremendous new features: the first new build tool emulation in almost a decade; shared jobcache; on-the-fly patching for those challenging environments where no other option will do. But as always, my eye is already on the next horizon: Accelerator 9.1. We have some big plans relating to performance and ease-of-use. It will require a lot of hard work but I think we have the right team to do it. Stay tuned.

Accelerator 9.0 is available immediately for existing customers — support@electric-cloud.com to get the bits. New users can download ElectricAccelerator Huddle to take it for a test drive, or contact sales@electric-cloud.com for an evaluation of the enterprise edition.

What’s new in GNU make 4.2?

In May 2016 the GNU make team released GNU make 4.2. I’m pleased to see another release, though I find myself underwhelmed by both the timeline and the content of this release. When 4.1 came out just one year after 4.0 I hoped it was a sign that the GNU make project was switching to a more frequent and regular release cycle, as many software projects have done in the last several years. Although it can be a difficult adjustment this release cadence can have significant benefits like improving user engagement and reducing risk. But with the 4.2 release arriving nineteen long months after 4.1 it seems that GNU make has failed to make the transition.

Of course infrequent releases are not necessarily a problem, as long as the releases contain compelling new functionality. Unfortunately the new features in GNU make 4.2 are charitably described as “uninspiring” — though I’m sure each enhancement will be handy for the corner case it was designed to address. Of course GNU make is a mature project by any definition, and frankly it does what it does pretty well and has for a very long time — maybe it’s just “done”. But consider this: the past few years has seen something of an explosion in the build tool space, with several new build tools cropping up. Each of the following alternative build tools has had multiple releases in the last year, and each has innovative features that could be adopted by GNU make:

gradle, the default build tool for Android apps. Monthly releases. Reports and notifications.
bazel, the open-source version of Google’s internal build system. Ten releases already in 2016. Checksum-based up-to-date checks and minimization of test suite executions.
ninja, a make-like build tool. Two releases in the last twelve months. Resource pools and unbelievably fast parsing / low overhead.

So, what does GNU make 4.2 have to offer? Read on to see, and let me know in the comments if you disagree with my analysis.

.SHELLSTATUS variable

GNU make has had the $(shell) function for many years. This provides a mechanism by which you can get the result (stdout) of an arbitrary command into a make variable, where you can do whatever you like with it. One curious thing about $(shell) is that it doesn’t care at all whether the command you execute succeeds or not, so if you try to read a non-existent file, for example, with $(shell cat missing.txt), GNU make will simply return an empty string as the result of the shell invocation. In some cases you may want to actually check the exit status of that command, and in GNU make 4.2 you now can, by checking the value of .SHELLSTATUS, a new built-in variable that is automatically set to the exit code of the most recent $(shell) (or != assignment). Here’s a contrived example:

1

2

3

4

5

FOO := $(shell exit 1)

ifneq ($(.SHELLSTATUS),0)

   $(error shell command failed!)

endif

all: ; @echo done

As you can see, it’s now possible to make your makefile react in whatever manner you deem appropriate when a shell invocation fails. Be advised, however: if you find yourself doing this, it may be an indication that your makefile is poorly written — almost every use of $(shell) is better handled by creating an actual rule to do whatever you were going to do with $(shell).

Read files with $(file)

The $(file) function was added to GNU make in the 4.0 release, in order to enable the creation of files directly from make — quite handy for those cases in which the content you want to write is so long it exceeds command-line length limits on your system. In 4.2 the $(file) function was extended so that you can use it to read files in addition to writing files. For example, SRCS := $(file <sourcelist.txt) would capture the content of the file sourcelist.txt in the variable SRCS, less the final newline in the file, if any (that last bit is for consistency with the $(shell) function).

Improved error reporting

GNU make 4.2 includes a small but very useful improvement in error reporting: previously when make encountered an error while executing a recipe, it would report only the name of the target being built, such as make: *** [all] Error 1. Starting with 4.2, this error message includes the makefile and line number of the specific command that produced the error: make: *** [Makefile:6: all] Error 1. This should make it much easier to debug large, complex builds — especially anything that uses double-colon rules to composite functionality from many fragments in distinct makefiles.

Bug fixes

In addition to the modest enhancements described above, the 4.2 release includes about three dozen other bug fixes. A glance at the resolution dates on those reveals that sometimes months passed with no updates. This makes me wonder why they didn’t cut a release at those points, even if it were just for bug fixes. My guess is that the project is trapped, in a sense: because the interval between releases is so long there’s a sense that each release has to be “perfect”, and because there’s an attempt to ensure each release is “perfect”, the interval between releases must be very long. Contrast this with a more agile approach, which can tolerate imperfect releases because the next release is just around the corner anyway. Combined with an ever expanding automated regression test suite it’s possible to gradually increase the bar for release quality, such that in fact the likelihood of a bad release goes down when compared with a project that has a long release cycle and is dependent on mostly manual testing.

GNU make isn’t going to go away any time soon, but I think the writing is on the wall: if it doesn’t start innovating again, developers will inevitably migrate to other build tools that do.

What’s new in GNU make 4.1?

October 2014 saw the release of GNU make 4.1. Although this release doesn’t have any really remarkable new features, the release is notable because it comes just one year after the 4.0 release — that’s the least time between releases in more than a decade. Hopefully, this is the start of a new era of more frequent, smaller releases for this venerable project which is one of the oldest still active projects in the GNU suite. Read on for notes about the new features in GNU make 4.1.

MAKE_TERMOUT and MAKE_TERMERR

Starting with 4.1, GNU make defines two additional variables: MAKE_TERMOUT and MAKE_TERMERR. These are set to non-empty values if make believes stdout/stderr is attached to a terminal (rather than a file). This enables users to solve a problem introduced by the output synchronization feature that was added in GNU make 4.0: when output synchronization is enabled, all child processes in fact write to a temporary file, even though in effect they are writing to the console. In other words, the implementation details of output synchronization may interfere with behaviors in child processes like output colorization which require a terminal for correct operation. If MAKE_TERMOUT or MAKE_TERMERR is set, then the user may explicitly direct such commands to maintain colorized output despite the fact that they appear to be writing to a file.

Enhanced $(file) function

The $(file) function was added in GNU make 4.0 to enable writing to files from a makefile without having to invoke a second process to do so. For example, where previously you had to do something like $(shell echo hello > myfile), now you can instead use $(file > myfile,foo). In theory this is more efficient, since it avoids creating another process, and it enables the user to easily write large blocks of text which would exceed command-line length limitations on some platforms.

In GNU make 4.1, the $(file) function has been enhanced such that the text to be written may be omitted from the function call. This allows $(file) to work as a sort of “poor man’s” replacement for touch, although having reviewed the bug report that resulted in this change, I think this is more an “enhancement of convenience” than a deliberate attempt to evolve the program. Of course I have to give a shout out to my friend Tim Murphy, who filed the bug report that led to this enhancement — nice work, Tim!

Relaxed constraints for mixing explicit and implicit rules

The final feature change in GNU make 4.1 is that make will emit a regular error rather than a fatal error (which terminates the build) when both explicit and pattern targets are specified as outputs of a rule, like this:

1

foo bar%: baz

This is an interesting change mostly for the high level of drama surrounding it. That bit of syntax is clearly illegal — in fact, if the pattern target is listed first rather than the explicit, GNU make has long identified this as invalid syntax, terminating the parse with *** mixed implicit and normal rules. Stop. Unfortunately, due to a defect in older versions of GNU make this construct is not prohibited when the explicit rule is named first.

In 3.82, the GNU make maintainers fixed the defect: whether or not the explicit target is named first, GNU make would identify the invalid syntax and terminate parsing. Everything was fine for about a year, and then? People flipped out. As it turns out, this construct is used by a prominant open source project: the Linux kernel. The offending syntax had been eliminated from the main development branch shortly after the 3.82 release, but third-party developers suddenly found themselves unable to build legacy versions of the kernel with the latest release of GNU make. A bug report was filed and generated 21 reponses, when the average GNU make bug report has only 3. Ultimately, the maintainers relented by reducing the severity to a non-fatal error for the 4.1 release — but with a stern message that this will likely become a fatal error again in a future release.

Bug fixes and thoughts

In addition to the bigger items identified above, the 4.1 release includes about two dozen other bug fixes. Overall, this release feels like a minor one — as often happens when release frequency increases, the individual releases become less interesting. From an agile/continuous delivery standpoint, that’s exactly what you want. But I’ve found that it is also difficult for a team that’s accustomed to less frequent releases with larger payloads to transition to smaller, more frequent releases while still incorporating large changes that take longer than one release to implement. Of course, one point does not make a line — that is, we can’t tell from this release alone whether the intention is to switch to a more frequent release cadence, or whether this release is an exception. If they are trying to increase the frequency, I think it will be very interesting to see how the GNU make development team adapts to the new cadence. Regardless, I’d like to congratulate the team for this release and I look forward to seeing what comes next.

HOWTO: Intro to GNU make variables

One thing that many GNU make users struggle with is predicting the value of a variable. And it’s no wonder, with the way make syntax freely mingles text intended for two very distinct phases of execution, and with two “flavors” of variables with very different semantics — except, that is, when the one seems to behave like the other. In this article I’ll run you through the fundamentals of GNU make variables so you can impress your friends (well, your nerdy friends, anyway) with your ability to predict the value of a GNU make variable at social gatherings.

Basics
Creating variables
Variable flavors
Target-specific variables

Basics

Let’s start with the basics: a GNU make variable is simply a shorthand reference for another string of text. Using variables enables you to create more flexible makefiles that are also easier to read and modify. To create a variable, just assign a value to a name:

1

CFLAGS=-g -O2

Later, when GNU make sees a reference to the variable, it will replace the reference with the value of the variable — this is called expanding the variable. A variable reference is just the variable name wrapped in parenthesis or curly braces and prefixed with a dollar-sign. For example, this simple makefile will print “Hello, world!” by first assigning that text to a variable, then dereferencing the variable and using echo to print the variable’s value:

1
2
3

MSG = Hello, world!
all:
	@echo $(MSG)

Creating variables

NAME = value is just one of many ways to create a variable. In fact there are at least eight ways to create a variable using GNU make syntax, plus there are built-in variables, command-line declarations, and of course environment variables. Here’s a rundown of the ways to create a GNU make variable:

MYVAR = abc creates the variable MYVAR if it does not exist, or changes its value if it does. Either way, after this statement is processed, the value of MYVAR will be abc.
MYVAR ?= def will create the variable MYVAR with the value def only if MYVAR does not already exist.
MYVAR += ghi will create the variable MYVAR with the value if MYVAR does not already exist, or it will append ghi to MYVAR if it does already exist.
MYVAR := jkl creates MYVAR if it does not exist, or changes its value if it does. This variation is just like the first, except that it creates a so-called simple variable, instead of a recursive variable — more on that in a minute.

In addition to the various assignment operators, you can create and modify variables using the define directive — handy if you want to create a variable with a multi-line value. Besides that, the define directive is equivalent to the normal VAR=VALUE assignment.

1
2
3
4

define MYVAR
  abc
  def
endef

If you’re using GNU make 3.82 or later, you can add assignment operators to the define directive to modify the intent. For example, to append a multi-line value to an existing variable:

1
2
3
4

define MYVAR +=
  abc
  def
endef

But there are still more ways to create variables in GNU make:

Environment variables are automatically created as GNU make variables when GNU is invoked.
Command-line definitions enable you to create variables at the time you invoke GNU make, like this: gmake MYVAR=123.
Built-in variables are automatically created when GNU make starts. For example, GNU make defines a variable named CC which contains the name of the default C compiler (cc) and another named CXX which contains the name of the default C++ compiler (g++).

Variable flavors

Now that you know how to create a GNU make variable and how to dereference one, consider what happens when you reference a variable while creating a second variable. Let’s use a few simple exercises to set the stage. For each, the answer is hidden on the line following the makefile. You can reveal the answer by highlighting the hidden text in your browser.

Q1: What will this makefile print?

1

2

3

4

ABC = Hello!

MYVAR = $(ABC)

all:

@echo $(MYVAR)

A1: Hello!
Q2: What will this makefile print?

1

2

3

4

5

6

ABC = Hello!

MYVAR = $(ABC)

all:

@echo $(MYVAR)

ABC = Goodbye!

A2: Goodbye!
Q3: What will this makefile print?

1

2

3

4

5

6

ABC = Hello!

MYVAR := $(ABC)

all:

@echo $(MYVAR)

ABC = Goodbye!

A3: Hello!

Don’t feel bad if you were surprised by some of the answers! This is one of the trickiest aspects of GNU make variables. To really understand the results, you have to wrap your brain around two core GNU make concepts. The first is that there are actually two different flavors of variables in GNU make: recursive, and simple. The difference between the two is in how GNU make handles variable references on the right-hand side of the variable assignment — for brevity I’ll call these “subordinate variables”:

With simple variables, subordinate variables are expanded immediately when the assignment is processed. References to subordinate variables are replaced with the value of the subordinate variable at the moment of the assignment. Simple variables are created when you use := in the assignment.
With recursive variables, expansion of subordinate variables is deferred until the variable named on the left-hand side of the assignment is itself referenced. That leads to some funny behaviors, because the value of the subordinate variables at the time of the assignment is irrelevant — in fact, the subordinate variables may not even exist at that point! What matters is the value of the subordinate variables when the LHS variable is expanded. Recursive variables are the default flavor, and they’re created when you use simply = in the assignment.

The second concept is that GNU make processes a makefile in two separate phases, and each phase processes only part of the text of the makefile. The first phase is parsing, during which GNU make interprets all of the text of the makefile that is outside of rule bodies. During parsing, rule bodies are not interpreted — only extracted for processing during the second phase: execution, or when GNU make actually starts running the commands to update targets. For purposes of this discussion, that means that the text in rule bodies is not expanded until after all the other text in the makefile has been processed, including variable assignments that physically appear after the rule bodies. In the following makefile, the text highlighted in green is processed during parsing; the text highlighted in blue is processed later, during execution. Again, to put a fine point on it: all of the green text is processed before any of the blue text:

1
2
3
4
5
6

ABC = Hello!
MYVAR = $(ABC)
all:
	@echo $(MYVAR)
 
ABC = Goodbye!

Now the examples above should make sense. In Question 2, we created MYVAR as a recursive variable, which means the value of ABC at the time MYVAR is created doesn’t matter. By the time GNU make needs to expand MYVAR, the value of ABC has changed, so that’s what we see in the output.

In Question 3, we created MYVAR as a simple variable, so the value of ABC was captured immediately. Even though the value of ABC changes later, that change doesn’t affect the value of MYVAR.

Target-specific variables

Most variables in GNU make are global: that is, they are shared across all targets in the makefile and expanded the same way for all targets, subject to the rules outlined above. But GNU make also supports target-specific variables: variables given distinct values that are only used when expanding the recipe for a specific target (or its prerequisites).

Syntactically, target-specific variables look like a mashup of normal variable definitions, using =, :=, etc.; and prerequisite declarations. For example, foo: ABC = 123 creates a target-specific definition of ABC for the target foo. Even if ABC has already been defined as a global variable with a different value, this target-specific definition will take precedence when expanding the recipe for foo. Consider this simple makefile:

1
2
3
4
5
6
7
8

ABC = Hello!
all: foo bar
foo:
	@echo $(ABC)
 
bar: ABC = Goodbye!
bar: 
	@echo $(ABC)

At first glance you might expect this makefile to print “Goodbye!” twice — after all, ABC is redefined with the value “Goodbye!” before the commands for foo are expanded. But because the redefinition is target-specific, it only applies to bar. Thus, this makefile will print one “Hello!” and one “Goodbye!”.

As noted, target-specific variables are inherited from a target to its prereqs — for example, the following makefile will print “Surprise!”, because bar inherits the target-specific value for ABC from foo:

1
2
3
4
5
6

ABC = Normal.
foo: ABC = Surprise!
foo: bar
 
bar: 
	@echo $(ABC)

You can do some neat tricks with this, but I urge you not to rely on the behavior, because it doesn’t always work the way you might think. In particular, if a target is listed as a prereq for multiple other targets, each of which have a different target-specific value for some variable, the actual value used for the prereq may vary depending on which files were out-of-date and the execution order of the targets. As a quick example, compare the output of the previous makefile when invoked with gmake foo and when invoked with gmake bar. In the latter case, the target-specific value from foo is never applied, because foo itself was not processed. With GNU make 3.82 or later, you can prevent this inheritence by using the private modifier, as in foo: private ABC = Surprise!.

Finally, note that target-specific variables may be applied to patterns. For example, a line reading %.o: ABC=123 creates a target-specific variable for all targets matching the pattern %.o.

Conclusion

If you’ve made it this far, you now know just about everything there is to know about GNU make variables. Congratulations! I hope this information will serve you well.

Questions or comments? Use the form below or hit me up on Twitter @emelski.

The Twelve Days of Christmas, GNU make style

Well, it’s Christmas Day in the States today, and while we’re all recovering from the gift-opening festivities, I thought this would be the perfect time for a bit of fun with GNU make. And what better subject matter than the classic Christmas carol “The Twelve Days of Christmas”? Its repetitive structure is perfect for demonstrating how to use several of GNU make’s built-in functions for iteration, selection and sorting. This simple makefile prints the complete lyrics to the song:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

L01=Twelve drummers drumming,
L02=Eleven pipers piping,
L03=Ten lords-a-leaping,
L04=Nine ladies dancing,
L05=Eight maids-a-milking,
L06=Seven swans-a-swimming,
L07=Six geese-a-laying,
L08=Five golden rings,
L09=Four calling birds,
L10=Three french hens,
L11=Two turtle doves, and
L12=A partridge in a pear tree!
LINES=12 11 10 09 08 07 06 05 04 03 02 01
DAYS=twelfth eleventh tenth ninth \
	eighth seventh sixth fifth \
	fourth third second first
 
$(foreach n,$(LINES),\
	$(if $(X),$(info ),$(eval X=X))\
	$(info On the $(word $n,$(DAYS)) day of Christmas,)\
	$(info my true love gave to me)\
	$(foreach line,$(wordlist $n,12,$(sort $(LINES))),\
		$(info $(L$(line)))))
 
all: ; @:

By count, most of the lines here just declare variables, one for each item mentioned in the song. Note how the items are ordered: the last item added is given the lowest index. That means that to construct each verse we simply enumerate every item in the list, in order, starting with the new item in each verse.

Line 18 is where the real meat of the makefile begins. Here we use GNU make’s foreach function to iterate through the verses. $(foreach) takes three arguments: a name for the iteration variable, a space-separated list of words to assign to the iteration variable in turn, and a body of text to expand repeatedly, once for each word in the list. Here, the list of words is given by LINES, which lists the starting line for each verse, in order — that is, the first verse starts from line 12, the second from line 11, etc. The text to expand on each iteration is all the text on lines 19-23 of the makefile — note the use of backslashes to continue each line to the next.

Line 19 uses several functions to print a blank line before starting the next verse, if we’ve printed a verse already: the $(if) function, which expands its second argument if its first argument is non-empty, and its third argument if its first argument is empty; the $(info) function to print a blank line; and the $(eval) function to set the flag variable. The first time this line is expanded, X does not exist, so it expands to an empty string and the $(if) picks the “else” branch. After that, X has a value, so the $(if) picks the “then” branch.

Lines 20 and 21 again use $(info) to print output — this time the prelude for the verse, like “On the first day of Christmas, my true love gave to me”. The ordinal for each day is pulled from DAYS using the $(word) function, which extracts a specified word, given by its first argument, from the space-separated list given as its second argument. Here we’re using n, the iteration variable from our initial $(foreach) as the selector for $(word).

Line 22 uses $(foreach) again, this time to iterate through the lines in the current verse. We use line as the iteration variable. The list of words is given again by LINES except now we’re using $(sort) to reverse the order, and $(wordlist) to select a subset of the lines. $(wordlist) takes three arguments: the index of the first word in the list to select, the index of the last word to select, and a space-separated list of words to select from. The indices are one-based, not zero-based, and $(wordlist) returns all the words in the given range. The body of this $(foreach) is just line 23, which uses $(info) once more to print the current line of the current verse.

Line 25 has the last bit of funny business in this makefile. We have to include a make rule in the makefile, or GNU make will complain *** No targets. Stop. after printing the lyrics. If we simply declare a rule with no commands, like all:, GNU make will complain Nothing to be done for `all’.. Therefore, we define a rule with a single “no-op” command that uses the bash built-in “:” to do nothing, combined with GNU make’s @ prefix to suppress printing the command itself.

And that’s it! Now you’ve got some experience with several of the built-in functions in GNU make — not bad for a Christmas day lark:

$(eval) for dynamic interpretation of text as makefile content
$(foreach), for iteration
$(if), for conditional expansion
$(info), for printing output
$(sort), for sorting a list
$(word), for selecting a single word from a list
$(wordlist), for selecting a range of words from a list

Now — where’s that figgy pudding? Merry Christmas!

UPDATE: SCons is Still Really Slow

A while back I posted a series of articles exploring the scalability of SCons, a popular Python-based build tool. In a nutshell, my experiments showed that SCons exhibits roughly quadratic growth in build runtimes as the number of targets increases:

Recently Dirk Baechle attempted to rebut my findings in an entry on the SCons wiki: Why SCons is not slow. I thought Dirk made some credible suggestions that could explain my results, and he did some smart things in his effort to invalidate my results. Unfortunately, his methods were flawed and his conclusions are invalid. My original results still stand: SCons really is slow. In the sections that follow I’ll share my own updated benchmarks and show where Dirk’s analysis went wrong.

Test setup

As before, I used genscons.pl to generate sample builds ranging from 2,000 to 50,000 targets. However, my test system was much beefier this time:

	2013	2010
OS	Linux Mint 14 (kernel version 3.5.0-17-generic)	RedHat Desktop 3 (kernel version 2.4.21-58.ELsmp)
CPU	Quad 1.7GHz Intel Core i7, hyperthreaded	Dual 2.4GHz Intel Xeon, hyperthreaded
RAM	16 GB	2 GB
HD	SSD	(unknown)
SCons	2.3.0	1.2.0.r3842
Python	2.7.3 (system default)	2.6.2

Before running the tests, I rebooted the system to ensure there were no rogue processes consuming memory or CPU. I also forced the CPU cores into “performance” mode to ensure that they ran at their full 1.7GHz speed, rather than at the lower 933MHz they switch to when idle.

Revisiting the original benchmark

I think Dirk had two credible theories to explain the results I obtained in my original tests. First, Dirk wondered if those results may have been the result of virtual memory swapping — my original test system had relatively little RAM, and SCons itself uses a lot of memory. It’s plausible that physical memory was exhausted, forcing the OS to swap memory to disk. As Dirk said, “this would explain the increase of build times” — you bet it would! I don’t remember seeing any indication of memory swapping when I ran these tests originally, but to be honest it was nearly 4 years ago and perhaps my memory is not reliable. To eliminate this possibility, I ran the tests on a system with 16 GB RAM this time. During the tests I ran vmstat 5, which collects memory and swap usage information at five second intervals, and captured the result in a log.

Next, he suggested that I skewed the results by directing SCons to inherit the ambient environment, rather than using SCons’ default “sanitized” environment. That is, he felt I should have used env = Environment() rather than env = Environment(ENV = os.environ). To ensure that this was not a factor, I modified the tests so that they did not inherit the environment. At the same time, I substituted echo for the compiler and other commands, in order to make the tests faster. Besides, I’m not interested in benchmarking the compiler — just SCons! Here’s what my Environment declaration looks like now:

env = Environment(CC = 'echo', AR = 'echo', RANLIB = 'echo')

With these changes in place I reran my benchmarks. As expected, there was no change in the outcome. There is no doubt: SCons does not scale linearly. Instead the growth is polynomial, following an n^1.85 curve. And thanks to the the vmstat output we can be certain that there was absolutely no swapping affecting the benchmarks. Here’s a graph of the results, including an n^1.85 curve for comparison — notice that you can barely see that curve because it matches the observed data so well!

For comparison, I used the SCons build log to make a shell script that executes the same series of echo commands. At 50,000 targets, the shell script ran in 1.097s. You read that right: 1.097s. Granted, the shell script doesn’t do stuff like up-to-date checks, etc., but still — of the 3,759s average SCons runtime, 3,758s — 99.97% — is SCons overhead.

I also created a non-recursive Makefile that “builds” the same targets with the same echo commands. This is a more realistic comparison to SCons — after all, nobody would dream of actually controlling a build with a straight-line shell script, but lots of people would use GNU make to do it. With 50,000 targets, GNU make ran for 82.469s — more than 45 times faster than SCons.

What is linear scaling?

If the performance problems are so obvious, why did Dirk fail to see them? Here’s a graph made from his test results:

Dirk says that this demonstrates “SCons’ linear scaling”. I find this statement baffling, because his data clearly shows that SCons does not scale linearly. It’s simple, really: linear scaling just means that the build time increases by the same amount for each new target you add, regardless of how many targets you already have. Put another way, it means that the difference in build time between 1,000 targets and 2,000 targets is exactly the same as the difference between 10,000 and 11,000 targets, or between 30,000 and 31,000 targets. Or, put yet another way, it means that when you plot the build time versus the number of targets, you should get a straight line with no change in slope at any point. Now you tell me: does that describe Dirk’s graph?

Here’s another version of that graph, this time augmented with a couple additional lines that show what the plot would look like if SCons were truly scaling linearly. The first projection is based on the original graph from 2,500 to 4,500 targets — that is, if we assume that SCons scales linearly and that the increase in build time between 2,500 and 4,500 targets is representative of the cost to add 2,000 more targets, then this line shows us how we should expect the build time to increase. Similarly, the second projection is based on the original graph between 4,500 and 8,500 targets. You can easily see that the actual data does not match either projection. Furthermore you can see that the slope of these projections is increasing:

This shows the importance of testing at large scale when you’re trying to characterize the scalability of a system from empirical data. It can be difficult to differentiate polynomial from logarithmic or linear at low scales, especially once you incorporate the constant factors — polynomial algorithms can sometimes even give better absolute performance for small inputs than linear algorithms! It’s not until you plot enough data points at large enough values, as I’ve done, that it becomes easy to see and identify the curve.

What does profiling tell us?

Next, Dirk reran some of his tests under a profiler, on the very reasonable assumption that if there was a performance problem to be found, it would manifest in the profiling data — surely at least one function would demonstrate a larger-than-expected growth in runtime. Dirk only shared profiling data for two runs, both incremental builds, at 8,500 and 16,500 targets. That’s unfortunate for a couple reasons. First, the performance problem is less apparent on incremental builds than on full builds. Second, with only two datapoints it is literally not possible to determine whether growth is linear or polynomial. The results of Dirk’s profiling was negative: he found no “significant difference or increase” in any function.

Fortunately it’s easy to run this experiment myself. Dirk used cProfile, which is built-in to Python. To profile a Python script you can inject cProfile from the command-line, like this: python -m cProfile scons. Just before Python exits, cProfile dumps timing data for every function invoked during the run. I ran several full builds with the profiler enabled, from 2,000 to 20,000 targets. Then I sorted the profiling data by function internal time (time spent in the function exclusively, not in its descendents). In every run, the same two functions appeared at the top of the list: posix.waitpid and posix.fork. To be honest this was a surprise to me — previously I believed the problem was in SCons’ Taskmaster implementation. But I can’t really argue with the data. It makes sense that SCons would spend most of its time running and waiting for child processes to execute, and even that the amount of time spent in these functions would increase as the number of child processes increases. But look at the growth in runtimes in these two functions:

Like the overall build time, these curves are obviously non-linear. Armed with this knowledge, I went back to Dirk’s profiling data. To my surprise, posix.waitpid and posix.fork don’t even appear in Dirk’s data. On closer inspection, his data seems to include only a subset of all functions — about 600 functions, whereas my profiling data contains more than 1,500. I cannot explain this — perhaps Dirk filtered the results to exclude functions that are part of the Python library, assuming that the problem must be in SCons’ own code rather than in the library on which it is built.

This demonstrates a second fundamental principle of performance analysis: make sure that you consider all the data. Programmers’ intuition about performance problems is notoriously bad — even mine! — which is why it’s important to measure before acting. But measuring won’t help if you’re missing critical data or if you discard part of the data before doing any analysis.

Conclusions

On the surface, performance analysis seems like it should be simple: start a timer, run some code, stop the timer. Done correctly, performance analysis can illuminate the dark corners of your application’s performance. Done incorrectly — and there are many ways to do it incorrectly — it can lead you on a wild goose chase and cause you to squander resources fixing the wrong problems.

Dirk Baechle had good intentions when he set out to analyze SCons performance, but he made some mistakes in his process that led him to an erroneous conclusion. First, he didn’t run enough large-scale tests to really see the performance problem. Second, he filtered his experimental data in a way that obscured the existence of the problem. But perhaps his worst mistake was to start with a conclusion — that there is no performance problem — and then look for data to support it, rather than starting with the data and letting it impartially guide him to an evidence-based conclusion.

To me the evidence seems indisputable: SCons exhibits roughly quadratic growth in runtimes as the number of build targets increases, rendering it unusable for large-scale software development (tens of thousands of build outputs). There is no evidence that this is a result of virtual memory swapping. Profiling suggests a possible pair of culprits in posix.waitpid and posix.fork. I leave it to Dirk and the SCons team to investigate further; in the meantime, you can find my test harness and test results in my GitHub repo. If you can see a flaw in my methodology, sound off in the comments!

What’s new in GNU make 4.0?

After a little bit more than three years, the 4.0 release of GNU make finally arrived in October. This release packs in a bunch of improvements across many functional areas including debuggability and extensibility. Here’s my take on the most interesting new features.

Output synchronization

For the majority of users the most exciting new feature is output synchronization. When enabled, output synchronization ensures that the output of each job is kept distinct, even when the build is run in parallel. This is a tremendous boon to anybody that’s had the misfortune of having to diagnose a failure in a parallel build. This simple Makefile will help demonstrate the feature:

1
2
3
4
5
6
7
8
9
10
11
12

all: a b c
a:
   @echo COMPILE a
   @sleep 1 && echo a, part 1
   @sleep 1 && echo a, part 2
   @sleep 2 && echo a, part 3
 
b c:
   @echo COMPILE $@
   @sleep 1 && echo $@, part 1
   @sleep 1 && echo $@, part 2
   @sleep 1 && echo $@, part 3

Now compare the output when run serially, when run in parallel, and when run in parallel with –output-sync=target:

Serial Parallel Parallel with –output-sync=target

$ gmake
COMPILE a
a, part 1
a, part 2
a, part 3
COMPILE b
b, part 1
b, part 2
b, part 3
COMPILE c
c, part 1
c, part 2
c, part 3

$ gmake -j 4
COMPILE a
COMPILE b
COMPILE c
b, part 1
a, part 1
c, part 1
b, part 2
a, part 2
c, part 2
b, part 3
c, part 3
a, part 3

$ gmake -j 4 --output-sync=target
COMPILE c
c, part 1
c, part 2
c, part 3
COMPILE b
b, part 1
b, part 2
b, part 3
COMPILE a
a, part 1
a, part 2
a, part 3

Here you see the classic problem with parallel gmake build output logs: the output from each target is mixed up with the output from other targets. With output synchronization, the output from each target is kept separate, not intermingled. Slick! The output doesn’t match that of the serial build, unfortunately, but this is still a huge step forward in usability.

The provenance of this feature is especially interesting, because the idea can be traced directly back to me — in 2009, I wrote an article for CM Crossroads called Descrambling Parallel Build Logs. That article inspired David Boyce to submit a patch to GNU make in 2011 which was the first iteration of the –output-sync feature.

GNU Guile integration

The next major addition in GNU make 4.0 is GNU Guile integration, which makes it possible to invoke Guile code directly from within a makefile, via a new $(guile) built-in function. Naturally, since Guile is a general-purpose, high-level programming language, this allows for far more sophisticated computation from directly within your makefiles. Here’s an example that uses Guile to compute Fibonacci numbers — contrast with my Fibonacci in pure GNU make:

1
2
3
4
5
6
7
8
9
10
11

define FIBDEF
(define (fibonacci x)
 (if (< x 2)
  x
  (+ (fibonacci (- x 1)) (fibonacci (- x 2)))))
#f
endef
$(guile $(FIBDEF))
 
%:
   @echo $(guile (fibonacci $@))

Obviously, having a more expressive programming language available in makefiles will make it possible to do a great deal more with your make-based builds than ever before. Unfortunately I think the GNU make maintainers made a couple mistakes with this feature which will limit its use in practice. First, Guile was a poor choice. Although it’s a perfectly capable programming language, it’s not well-known or in wide use compared to other languages that they might have chosen — although you can find Scheme on the TIOBE Index, Guile itself doesn’t show up, and even though it is the official extension language of the GNU project, fewer than 25 of the GNU project’s 350 packages use Guile. If the intent was to embed a language that would be usable by a large number of developers, Python seems like the no-brainer option. Barring that for any reason, Lua seems to be the de facto standard for embedded programming languages thanks to its small footprint and short learning curve. Guile is just some weird also-ran.

Second, the make/Guile integration seem a bit rough. The difficulty arises from the fact that Guile has a rich type system, while make does not — everything in make is a string. Consequently, to return values from Guile code to make they must be converted to a string representation. For many data types — numbers, symbols and of course strings themselves — the conversion is obvious, and reversible. But for some data types, this integration does a lossy conversion which makes it impossible to recover the original value. Specifically, the Guile value for false, #f, is converted to an empty string, rendering it indistinguishable from an actual empty string return value. In addition, nested lists are flattened, so that (a b (c d) e) becomes a b c d e. Of course, depending on how you intend to use the data, each of these may be the right conversion. But that choice should be left to the user, so that we can retain the additional information if desired.

Loadable objects

The last big new feature in GNU make 4.0 is the ability to dynamically load binary objects into GNU make at runtime. In a nutshell, that load of jargon means that it’s possible for you to add your own “built-in” functions to GNU make, without having to modify and recompile GNU make itself. For example, you might implement an $(md5sum) function to compute a checksum, rather than using $(shell md5sum). Since these functions are written in C/C++ they should have excellent performance, and of course they can access the full spectrum of system facilities — file I/O, sockets, pipes, even other third-party libraries. Here’s a simple extension that creates a $(fibonacci) built-in function:

#include <stdio.h>
#include <gnumake.h>

int plugin_is_GPL_compatible;

int fibonacci(int n)
{
    if (n < 2) {
        return n;
    }
    return fibonacci(n - 1) + fibonacci(n - 2);
}

char *gm_fibonacci(const char *nm, unsigned int argc, char **argv)
{
    char *buf  = gmk_alloc(33);
    snprintf(buf, 32, "%d", fibonacci(atoi(argv[0])));
    return buf;
}

int fibonacci_gmk_setup ()
{
    gmk_add_function ("fibonacci", gm_fibonacci, 1, 1, 0);
    return 1;
}

And here’s how you would use it in a makefile:

1
2
3

load ./fibonacci.so
%:
   @echo $(fibonacci $@)

I’m really excited about this feature. People have been asking for additional built-in functions for years — to handle arithmetic, file I/O, and other tasks — but for whatever reason the maintainers have been slow to respond. In theory, loadable modules will enable people to expand the set of built-in functions without requiring the approval or involvement of the core team. That’s great! I only wish that the maintainers had been more responsive when we invited them to collaborate on the design, so we might have come up with a design that would work with both GNU make and Electric Make, so that extension authors need only write one version of their code. Ah well — que sera, sera.

Other features

In addition to the major feature described above there are several other enhancements worth mentioning here:

::= assignment, equivalent to := assignment, added for POSIX compatibility.
!= assignment, which is basically a substitute for $(shell), added for BSD compatibility.
–trace command-line option, which causes GNU make to print commnds before execution, even if they would normally be suppressed by the @ prefix.
$(file …) built-in function, for writing text to a file.
GNU make development migrated from CVS to git.

You can find the full list of updates in the NEWS file in the GNU make source tree.

Looking ahead

It’s great to see continued innovation in GNU make. Remember, this is a tool that’s now 25 years old. How much of the software you wrote 25 years ago is still in use and still in active development? I’d like to offer a heartfelt congratulations to Paul Smith and the rest of the GNU make team for their accomplishments. I look forward to seeing what comes next!

The inverted parallel build bug

At some point most of you have encountered “the” parallel build problem: a build that works just fine when run serially, but breaks sometimes when run in parallel. You may have read my blog about how ElectricAccelerator automatically solves the classic parallel build problem. Recently I ran into the opposite problem in a customer’s build: a build that “works” when run in parallel, but breaks when run serially! If you’re lucky, this build defect will just cause occasional build failures. If you’re unlucky, it will silently corrupt your build output at random. With traditional GNU make this nasty bug is a nightmare to track down — if you even know that its present!

In contrast, the unique features in ElectricAccelerator make it trivial to find the defect — some might even say it’s fun (well, if you’re like me and you enjoy using powerful tools to do sophisticated analysis without breaking a sweat!). Read on to see how ElectricAccelerator makes it easy to diagnose and fix bugs in your build.

The inverted parallel build bug

Let’s start with a concrete example. Here’s a simple Makefile which (appears to) work when run in parallel, but which consistently fails serially:

1
2
3
4
5
6
7
8

all: reader writer
	
reader:
	sleep 2
	cat output
	
writer:
	echo PASS > output

Assuming that output does not exist, executing this makefile serially will always produce an error:

$ gmake
sleep 2
cat output
cat: output: No such file or directory
gmake: *** [reader] Error 1

But if you execute this makefile in parallel, it appears to work!:

$ gmake -j 2
sleep 2
echo PASS > output
cat output
PASS

If we visualize the execution of these commands it’s easy to see why the parallel build seems to work:

At the beginning of the build, both reader and writer are started, more-or-less at the same time, because we told gmake to run two jobs at a time. reader has two commands, which are executed serially according to the semantics of make. While the sleep 2 is executing, the echo command in writer runs and completes. When the cat command in reader starts, it succeeds because output is ready-to-go.

Parallel execution is no guarantee

Some people will look at that explanation and think “Got it — always run this thing in parallel and we’re good!” Of course, you can’t really be 100% sure that everybody will remember to run the makefile in parallel. But even if you could, there’s a flaw in that reasoning: basically, your build has a race condition, and there’s no guarantee that you’ll “win” the race every time. For example, if your build server is heavily loaded, the sequence of events might look like this instead:

Here, writer doesn’t get started until after the sleep command has finished — too late to save the cat command from failure.

Build failure is not the worst outcome

Before we move on to finding and fixing problems like this, let’s take a quick look at one more failure mode: incremental builds. In particular, check out what happens if output exists before the build starts, but with incorrect content (for example, stale data from an earlier build):

$ echo '*** FAIL ***' > output
$ gmake
sleep 2
cat output
*** FAIL ***
echo PASS > output
$ echo $?
0

That’s right — the build “succeeded”, because it produced no error messages and exited with a zero exit code. And yet, it produced completely bogus output. Ouch!

Somebody save me!

If you’re using ordinary GNU make, you’re in for a world of hurt with a problem like this. First, the only way to consistently reproduce the problem is to run the entire build serially — of course that probably takes a long time, or you wouldn’t have been using parallel builds in the first place. Second, there are no diagnostics built into gmake that could help you identify which job produces output. One option is to use strace to monitor filesystem accesses, but that will generate a mountain of data in a not-very-usable format. Plus, it imposes a substantial performance penalty — on top of the hit you’d already take for running the build serially. Yuck!

If you’re using Electric Make, this problem is embarrassingly easy to solve thanks to emake’s core features:

Consistent results: emake mimics serial execution with gmake, so you’ll always get a consistent result with this build. That means it will fail, the same way, every time, which means you’ll discover the problem immediately after it is introduced, not months or years later after it has become nearly impossible to tell which Makefile change introduced the defect.
Parallel speed: emake’s results match those of a serial gmake build, but its performance is more like that of a parallel gmake build — better, in most cases.
Annotated build logs: emake can generate an XML-enhanced version of the build output log which contains a record of every file accessed by every job in the build. This annotation file can easily be mined to identify pairs of jobs where the reader preceeds the writer.

You can use any general purpose XML parsing library to read annotation files, but it’s easy to use annolib, the high-performance annotation processing library we wrote to facilitate this kind of analysis. Since annolib is built into ElectricInsight, the easiest way to use it is to write the analysis as a custom Insight report. All you need to do is iterate through the files referenced in the build, looking for read operations (or, in this case, failed lookups) preceeding a write operation. Here’s the code:

global anno
set instances [list]

# Iterate over the files referenced in the build...

foreach filename [$anno files] {
    set readers [list]

    # Iterate over the operations performed on the file...

    foreach tuple [$anno file operations $filename] {
        foreach {job op dummy} $tuple { break }
        if { $op == "read" || $op == "failedlookup" } {
            # If this is a read operation, note the job that did the read.

            lappend readers $job
        } elseif {$op == "create" || $op == "modify" || $op == "truncate"} {
            # If this is a write operation but earlier jobs already read
            # the file, we've found a read-before-write instance.

            if { [llength $readers] } {
                lappend instances [list $readers $job $filename]
            }

            # After we see a write on this file we can move on to the next.

            break
        }
    }
}

# For each instance, print the filename, the writer, and each reader.

set result ""
foreach instance $instances {
    foreach {readers writer filename} $instance { break }
    set writerName [$anno job name $writer]
    set writerFile [$anno job makefile $writer]
    set writerLine [$anno job line $writer]
    append result "FILENAME:\n  $filename\n"
    append result "WRITER  :\n  $writerName ($writerFile:$writerLine)\n"
    append result "READERS :\n"
    foreach reader $readers {
        set readerName [$anno job name $reader]
        set readerFile [$anno job makefile $reader]
        set readerLine [$anno job line $reader]
        append result "  $readerName ($readerFile:$readerLine)\n"
    }
}

With a bit of additional boilerplate you can run this report from the command-line with Insight 4.0 (currently in limited beta). A couple notes on usage: you should instruct emake to generate lookup-level annotation, by adding –emake-annodetail=lookup to your invocation. And, you should run the build with the -k (keep-going) option — otherwise, the error in reader will prevent writer from running, and emake will not record filesystem usage for it. Once you have a suitable annotation file, here’s how the report looks for this build:

$ einsight --report=ReadBeforeWrite emake.xml
done.
FILENAME:
  /home/ericm/test/output
WRITER  :
  writer (Makefile:7)
READERS :
  reader (Makefile:3)

Voila! We’ve pinpointed the problem with barely 50 lines of code (including comments!). You can even see a solution: add writer as a prerequisite of reader, on line 3 of Makefile.

Show me what you can do with ElectricAccelerator

As you’ve seen, ElectricAccelerator makes it easy to identify and correct build problems that would otherwise be nearly impossible to root out. Hopefully you also see that this is just the tip of the iceberg — with consistent fast builds and the treasure trove of data available in annotation files, what other analysis could you do? To get started, you can download a free trial of ElectricAccelerator Developer Edition and check out the reports in ElectricInsight. You can also download the Read Before Write report for ElectricInsight from my GitHub repo. If you come up with something cool, tell me about it in the comments!

What is the fastest way to find non-zero bits in an MD5 hash?

Microbenchmarks, as a general rule, are a waste of time. Let’s just get that out of the way up front. They are also, as a general rule, totally inaccurate, measuring the execution time of some snippet of code in a context that is completely divorced from the reality in which that code will actually be used. So if after reading this article you think, “I should tell Eric what a waste of time this was!” — don’t bother. I already know.

But… microbenchmarks are also fun, and sometimes interesting, and often vastly easier to implement than a real benchmark of the same code in a production system. So a couple weeks ago, when my colleague proposed using an MD5 hash with value zero as a sentinal indicating that the checksum had not yet been calculated, I wondered: what is the fastest way to test if an MD5 hash has any non-zero bits? I had some time to kill so I wrote a microbenchmark comparing several implementations. The results are presented here for your amusement and edification.

The benchmark

The goal of the benchmark is to determine which of several methods can most quickly determine whether an MD5 hash is all zeroes. An MD5 hash is 128 bits long, so in essence this problem boils down to simply checking for non-zero bits in an arbitrary sequence of 16 bytes. You can find the benchmark source code in my Github repo.

For sample data I simply allocated about 100,000 17 byte arrays, then set one byte in each to a non-zero value. This structure made it wasy to easily test the effect of memory alignment on performance, by using either the first 16 or the last 16 bytes of each 17 byte array as the value under test. The total size was significant as well: smaller than the L1 cache on a typical modern CPU, so we avoid measuring memory bandwidth performance.

I tested the following methods for determining whether a hash is all zero, in both aligned and unaligned varieties:

Naive loop over bytes: the most obvious approach simply loops over the bytes, testing each in turn.
Unrolled loop over bytes: loop unrolling is a common optimization for loops with a fixed number of iterations.
Bitwise OR of bytes: OR all bytes together, then compare the result to zero.
Slice by four: treat the 16 bytes as an array of four 32-bit integers, testing each for equality with zero.
Slice by eight: treat the 16 bytes as an array of two 64-bit integers, testing each for equality with zero.
Find first set: use the GCC builtin function __builtin_ffs, which finds the first non-zero bit in a 32-bit integer.

The code was compiled to 64-bit binaries using GCC 4.7.2 with -O2 optimization and no debugging symbols.

The results

I ran the benchmarks on two systems. First I tried a quad-core hyperthreaded 1.73GHz Intel Core i7 laptop with 16GB RAM and an SSD hard drive. All cores were put into performance mode to ensure no CPU frequency scaling was enabled, using (for example) echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor. Here are the results (longer is better):

The results are a bit erratic, to be sure — for example, it makes no sense that the unaligned version of the unrolled naive loop should be faster than the aligned version. This is likely just because the operation being measured is so fast that it’s hard to get a “pure” measurement: even tiny fluctuations in system load perturb the tests. You’ll see that if you run the benchmark a few times, you’ll get slightly different results each time. That just means that we shouldn’t make any hard-and-fast decisions based on the exact numbers here.

Nevertheless, the difference between slice-by-four or slice-by-eight and the other strategies is substantial enough that I trust the overall result, if not the exact numbers. That is, slice-by-four and slice-by-eight are clearly significantly faster than any other approach. But — and here’s where we discover the degree of yak shaving we’ve been up to — even the slowest strategy is still pretty damn fast. In all honesty, it is not going to make a lick of difference in overall application performance, unless you really do need to do billions of these checks. A realistic upper bound for my application is maybe ten million, which would consume a tiny fraction of a second using even the naive loop.

One final surprise in this data is the difference in performance between aligned and unaligned memory access — or rather, the lack therof. Conventional wisdom is that you pay a performance penalty for accessing unaligned memory, at least when you try to treat it as 32- or 64-bit blocks. In fact, this result supports other tests which indicate that on the Intel Core i7 there is effectively no penalty for unaligned memory access.

If you’re only working on x86 architectures, you may consider this exercise concluded. But we actually run our software on SPARC architecture as well, so before committing to an implementation let’s take a look at how the benchmark behaves there. This time I used a 1GHz SPARCv9 CPU:

Slice-by-four and slice-by-eight are fastest here too, as long as the data is aligned. If not — BOOM! The application crashes with a bus error, because the SPARC architecture is actually quite sensitive to data alignment. If you want to treat a piece of memory as an integer, it had better be properly aligned.

Conclusion

Informed by these results, we opted to use the slice-by-four strategy. That required a modification of our code, which previously did not guarantee alignment of MD5 hashes. Fortunately that modification was trivial, so it cost us little time and did not make the code any less clear. But you can see hints of the real danger of microbenchmarks: it’s often difficult for a developer to ignore the existence of a faster-but-more-complex strategy, despite evidence that the simple implementation is more than adequately performant. In this case the cost of enabling the faster implementation was negligable, but I’ve seen developers (including myself) needlessly contort code in the name of performance, doggedly defending their choices with microbenchmarks like these. Don’t let yourself become another statistic: use microbenchmarks, sure, but always evaluate the results in the larger context of overall application performance.

With that, I invite you to sound off in the comments: what did I overlook in my microbenchmark? How were the tests flawed? What other strategies do you know for testing whether a series of 16 bytes contains any non-zero bits?

Factors that affect build performance

A foundation for performance investigations

Share this:

Ninja emulation

JobCache Enhancements

Dynamic file patching

Looking forward to 9.1

Share this:

.SHELLSTATUS variable

Read files with $(file)

Improved error reporting

Bug fixes

Share this:

MAKE_TERMOUT and MAKE_TERMERR

Enhanced $(file) function

Relaxed constraints for mixing explicit and implicit rules

Bug fixes and thoughts

Share this:

Contents

Conclusion

Share this:

Share this:

Test setup

Revisiting the original benchmark

What is linear scaling?

What does profiling tell us?

Conclusions

Share this:

Output synchronization

GNU Guile integration

Loadable objects

Other features

Looking ahead

Share this:

The inverted parallel build bug

Parallel execution is no guarantee

Build failure is not the worst outcome

Somebody save me!

Show me what you can do with ElectricAccelerator

Share this:

The benchmark

The results

Conclusion

Share this: