One idea that comes up now and then is to speed up parallel builds by being smarter about the order used to run the jobs in the build. Obviously we don’t have complete control over the order — we have to respect the dependencies, of course — but at any given point there are probably multiple jobs ready-to-run. All things being equal, the build ought to finish sooner if we start the longest jobs first. But does it really work out that way?
A compelling example
Here’s an example that a user posted on the GNU make mailing list:
Here’s the dependency graph for this simple makefile. The numbers in parenthesis indicate the serial order of the jobs in the build — the order the jobs will execute if the build runs serially:
The serial order is also the order that make will start the jobs when running in parallel. For example, if we run this makefile with gmake -j 2, we would see the execution proceed as follows:
- 0 minutes: gmake start jobs A and C.
- 1 minute: C completes and gmake starts D (two minutes left on job A).
- 2 minutes: D completes and gmake starts B (one minute left on job A).
- 3 minutes: A and B complete, and gmake starts E.
- 9 minutes: E completes, and the build ends.
Visually, it looks like this:
But this ordering is obviously quite inefficient. Job E is not dependent on any other jobs, so there’s no reason we can’t start it sooner. In fact, if we start E first, then the execution looks like this:
By starting the longest jobs first, we can trim the overall build time by an impressive 30%! So: obviously we can fabricate a build that shows significant improvement by use of a longest-job-first scheduler. But will we see similar results from real builds?
Look before you leap
To answer this question, I made a build simulator that simulates running a build using longest-job-first scheduling. The simulator uses job duration and dependency information from an annotation file generated during a real build with ElectricAccelerator, a high-performance gmake replacement.
I tested the longest-job-first scheduler on several builds:
To my disappointment, the new scheduler showed no significant benefit on these builds:
Build times in real builds are virtually unchanged with the longest-job-first scheduler. On some of these graphs you can barely even tell there are two distinct lines!
What went wrong?
I think there are two factors that explain this lackluster result. The first is homogeneity: in most builds, the majority of the jobs are more-or-less the same length. For example, 90% of the jobs in the Mozilla build are less than 0.25s long. 80% of the jobs in the Samba build are in the 2.5s to 5.0s range. What difference does it make to choose job A or job B when they both have nearly identical durations? Now, maybe you’re thinking “A-ha! What about the link job? That’s definitely longer than the other jobs!” And it’s true, the link job often is longer. But by its nature it can’t start any sooner than after all the other jobs have finished anyway — so again the longest-job-first scheduler has no choice to make, because there is only one job to choose.
The second factor is the longest-job-first scheduler is smarter, but not smart enough: by considering only the length of each job in isolation, the longest-job-first scheduler cannot account for situations where a short job blocks a very long job. For example, suppose that we change our original example by adding a prereq for job E:
Because job F is so short, the longest-job-first scheduler will never prioritize it over jobs A, B, C, or D — in fact, the scheduler won’t run job F until it is literally the only choice left. Unfortunately, that means that job E won’t run until the end of the build, clobbering our overall build time.
Where do we go from here?
From these simulations, it’s clear that there’s little point in pursuing a simple longest-job-first scheduler. But I think there may be something to the idea of a scheduler that considers not just individual job lengths but the relationships between jobs. I’ll explore that possibility in a future post.