Parallel execution is a popular technique for reducing software build length, and for good reason. These days, multi-core computers have become standard — even my laptop has four cores — so there’s horsepower to spare. And it’s “falling over easy” to implement: just slap a “-j” onto your make command-line, sit back and enjoy the benefits of a build that’s 2, 3 or 4 times faster than it used to be. Sounds great!
But then, inevitably, invariably, you run into parallel build problems: incomplete dependencies in your makefiles, tools that don’t adequately uniquify their temp file names, and any of a host of other things that introduce race conditions into your parallel build. Sometimes everything works great, and you get a nice, fast, correct build. Other times, your build blows up in spectacular fashion. And then there are the builds that appear to succeed, but in fact generate bogus outputs, because some command ran too early and used files generated in a previous instead of the current build.
This is precisely the problem ElectricMake was created to solve — it gives you fast, reliable parallel builds, regardless of how (im)perfect your makefiles and tools are. If the build works serially, it will work with ElectricMake, but faster. If you’ve worked with parallel builds for any length of time, you can probably appreciate the benefit of that guarantee.
But maybe you haven’t had much experience with parallel builds yourself, or maybe you have but like many people, you don’t believe this problem can actually be solved. In that case, perhaps some data will persuade you. Here’s a sample of open source projects that don’t build reliably in parallel using gmake:
For each, I did several trials with gmake at various levels of parallelism, to determine how frequently the parallel build fails. Then, I did the same build several times with emake and again measured the success rate. Here you can see the classic problem of parallel builds with gmake — works great at low levels of parallelism (or serially, the “degenerate” case of parallel!), but as you ratchet up the parallelism, the build gets less and less reliable. At the same time, you can see that emake is rock solid regardless of how much parallelism you use:
The prize for this reliability? Faster builds, because you can safely exploit more parallelism. Where gmake becomes unreliable with -j 3 or -j 4, emake is reliable with any number of parallel jobs.
How ElectricMake guarantees reliable parallel builds
The technology that enables emake to ensure reliable parallel builds is called conflict detection. Although there are many nuances to its implementation, the concept is simple. First, track every modification to every file accessed by the build as a distinct version of the file. Then, for each job run during the build, track the files used and verify that the job accessed the same versions it would have had the build run serially. Any mismatch is considered a conflict. The offending job is discarded along with any filesystem modifications it made, and the job is rerun to obtain the correct result.
The versioned file system
At the heart of the conflict detection system is a data structure known as the versioned file system, in which emake records every version of every file used over the lifetime of the build. A version is added to the data structure every time a file is modified, whether that be a change to the content of the file, a change in the attributes (like ownership or access permissions), or the deletion of the file. In addition to recording file state, a version records the job which created it. For example, here’s what the version chain looks like for a file “foo” which initially does not exist, then is created by job A with contents “abc”, deleted by job C, and recreated by job E with contents “123”:
Jobs are the basic unit of work in emake. A job represents all the commands that must be run in order to build a single makefile target. In addition, every job has a serial order — the order in which the job would have run, had the build been run serially. The serial order of a job is dictated by the dependencies and structure of the makefiles that make up the build. Note that for a given build, the serial order is deterministic and unambiguous — even if the dependencies are incomplete, there is exactly one order for the jobs when the build is run serially.
With the serial order for every job in hand, deciding which file version should be used by a given job is simple: just find the version created by the job with the greatest serial order that precedes the job accessing the file. For example, using the version chain above (and assuming that the jobs’ names reflect their serial order), job B should use the version created by job A, while job D should see the file as non-existent, thanks to the version created by job C.
A job enters the completed state once all of its commands have been executed. At that point, any filesystem updates created by the job are integrated into the versioned filesystem, but, critically, they are not pushed to the real filesystem — that gives emake the ability to discard the updates if the job is later found to have conflicts.
Each job runs against a virtual filesystem called the Electric File System (EFS), rather than the real filesystem. The EFS serves several important functions: first, it is the means by which emake tracks file accesses. Second, it enables commands in the build to access file versions that exist in the versioned filesystem, but not yet on the real filesystem. Finally, it isolates simultaneously running jobs from one another, eliminating the possibility of crosstalk between commands.
With all the data emake collects — every version of every file, and the relationship between every job — the actual conflict check is simple: for each file accessed by a job, compare the actual version to the serial version. The actual version is the version that was actually used when the job ran; the serial version is the version that would have been used, if the build had been run serially. For example, consider a job B which attempts to access a file foo. At the time that B runs, the version chain for foo looks like this:
Given that state, B will use the initial version of foo — there is no other option. The initial version is therefore the actual version used by job B. Later, job A creates a new version of foo:
Since job A precedes job B in serial order, the version created by job A is the correct serial version for job B. Therefore, job B has a conflict.
If a job is determined to be free of conflicts, the job is committed, meaning any filesystem updates are at last applied to the real filesystem. Any job that has a conflict is reverted — all versions created by the job are marked invalid, so subsequent jobs will not use them. The conflict job is then rerun in order to generate the correct result. The rerun job is committed immediately upon completion.
Conflict checks are carried out by a dedicated thread which inspects each job in strict serial order. That guarantees that a job is not checked for conflicts until after every job that precedes it in serial order has been successfully verified free of conflicts — without this guarantee, we can’t be sure that we know the correct serial version for files accessed by the job. Similarly, this ensures that the rerun job, if any, will use the correct serial versions for all files — so the rerun job is sure to be conflict free.
ElectricMake: reliable parallel builds
Conceptually, conflict detection is simple — keep track of every version of every file used in a build, then verify that each job used the correct version — but there are many details to its implementation. And in this article I’ve only covered the most basic implementation of conflict detection — after many years of experience and thousands of real-world builds we’ve tweaked the implementation, relaxing the simple definition of a conflict in specific cases in order to improve performance.
The benefit of conflict detection is simple too: reliable parallel builds, which in turn means shorter build times, regardless of how imperfect your makefiles are and how parallel-unsafe your toolchain may be.