Make syntax is the worst… except for all the alternatives

My series of comparisons between SCons and GNU make sparked a lot of discussion, not just about SCons and gmake, but about many other build tools. That was to expected, but what surprised me was several comments specifically criticizing the syntax of make — the semicolons, colons, ats and dollars that we all know so well. One reader actually said that make syntax has a 1970’s feel, as if the age of the language is somehow an indicator of unsuitability for the task. Then my friend John Graham-Cumming posted an article in defense of make syntax, and I figured I would add my thoughts.

Make syntax is the worst… except for all the alternatives

Criticisms of make syntax strike me as a bit absurd. Take a look around the build tool space: you’ll see that many of these “improved” tools use syntax that ranges from “pretty much the same” to “ridiculously verbose”. Let’s look at the syntax used for the two core functions of a build system: specifying the graph of depdencies between files, and specifying the commands to generate a file from a set of inputs.

Dependency graph syntax

The syntax for describing the relationship between input and output files in make is concise, if oblique:

foo: foo.in

To me this is elegant in its simplicity. You may argue that the choice of a colon is arbitrary, and you’d be right — but then, what would be significantly better? I would say there is nothing that is better, but plenty of things worse. For comparison, look at the same relationship, expressed in the syntax of some other build tools:

CMake
add_custom_command(
    OUTPUT foo
    COMMAND update -o foo foo.in
    DEPENDS foo.in)
Cook
foo: foo.in ;
Jam
MyCompile foo : foo.in ;
Rake
file foo => [ 'foo.in' ]
SCons
env.MyCompile('foo', 'foo.in')
tup
foo.in |> update -o %o %f |> foo
Waf
bld(
    rule     = 'update -o ${TGT} ${SRC}
    source   = 'foo.in',
    target   = 'foo')

Some of these, like Cook and Jam, are nearly identical to make. Others, like Waf, are certainly more verbose, but not obviously better. That verbosity may seem great when there’s only a handful of targets, but with hundreds of targets, it will be an irritation.

The truth is that there just isn’t any particular syntax that naturally lends itself to expressing a dependency graph. The reason make syntax hasn’t changed in over 30 years is because target: prereq works, and it’s just as good as anything else you might choose.

Command syntax

True to form, the syntax for specifying the commands to run to generate a file in make is just as terse:

update -o $@ $^

This minimalist syntax naturally puts the emphasis on the important stuff: the command to run and its flags. Here’s the same command in some other build tools (nota bene: some of these are the same as what’s shown above; in those cases I could not easily determine a syntax for specifying dependencies separately from commands, or whether that is even possible with that tool):

CMake
add_custom_command (
  OUTPUT foo
  COMMAND update -o foo foo.in
  DEPENDS foo.in
)
Cook
{
     update -o [target] [need];
}
Jam
rule MyRule
{
    MyCompile $(1) $(2) ;
}
actions MyCompile
{
    update -o $(1) $(2)
}
Rake
sh "update -o #{t.name} #{t.prerequisites.join(' ')}"
SCons
env.Append(BUILDERS =
  {'MyCompile': Builder(action = 'update -o $TARGET $SOURCE', 
    src_suffix='.in')})
tup
foo.in |> update -o %o %f |> foo
Waf
bld(
    rule     = 'update -o ${TGT} ${SRC}
    source   = 'foo.in',
    target   = 'foo')

Most of these are more verbose than make, and for me the extra text just makes it harder to see what’s really going on. The SCons example is particularly ugly: 6 times the characters to express the same simple command!

Did you mean TAB instead of 8 spaces?

I suspect that at the heart of complaints about make syntax is a single unfortunate confluence of facts. First, make uses a literal TAB character to mark the beginning of a command in a recipe. Second, most code editors automatically replace TAB with spaces. Together these facts conspire to confound even the most experienced makefile writer, resulting in this slightly condescending, always irritating error message:

*** missing separator (did you mean TAB instead of 8 spaces?)

.

I won’t argue with you, this is a real nuisance. But there’s good news: GNU make 3.82 introduced a new special variable called .RECIPEPREFIX. Set this variable to any character you like, and GNU make will use that instead of TAB to mark commands in the makefile. For example:

.RECIPEPREFIX=! all: !@echo Who says make syntax is bad?

Conclusion

Don’t get me wrong: as with any tool, there is room for improvement in make. I agree with John’s suggestion to optionally include command-lines and input file checksums in the up-to-date decisions (some of that is available now in ElectricAccelerator). Beyond that I think it would be great to add support for non-pattern rules with multiple outputs — there’s no way to do that now, although there are a variety of hacks to emulate non-pattern rules with multiple outputs. The interesting thing about these ideas is that all of them can be added to make, without requiring the creation of a completely new build tool.

Yes, make syntax is terse, but the lack of extraneous noise makes it easier to see what’s going on in a makefile than in a comparable build file from another tool. Likewise, make syntax is old, but rather than being a weakness, I see that as a testament to it’s fitness. Surely it’s telling that in 30 years, nothing else has come along that is obviously better, or sufficiently better to justify the cost of migration.

Public versus private clouds for dev/test

I recently wrote about our experience migrating to cloud computing to support development and QA activities. Our cloud enables us to support more platforms, at lower cost, and with less complexity than the fleet of physical servers it replaced. But I didn’t have room to talk about one important decision in our migration: whether to build a private cloud or use an existing public cloud like Amazon EC2 or Rackspace Cloud.

For us, the decision was easy. The public cloud is unsuitable for three reasons: platforms, bandwidth and money. First, the public cloud doesn’t support the platforms we need for testing. Second, uploading data to the public cloud takes way too long by today’s agile, continuous development standards. Finally, and probably most interesting to you, the public cloud is surprisingly expensive. In fact, I estimate that the public cloud would cost us more than twice as much as our private cloud, every year.

Public clouds don’t support all of our platforms

My product is supported on a smörgåsbord of x86-based platforms — various incarnations of Windows, from XP to Vista to Windows 7; and a variety of Linux distributions from RHEL4 to Ubuntu 10. Our quality standards demand that we run the platform-dependent portion of our test suite on every supported platform. Pretty standard stuff, I imagine. Too bad for us then that you can’t run XP, Vista or 7 in the cloud (see also here and here).

Bandwidth to the public cloud stinks

My company is connected to the Internet via a puny 10 Mbit EoC pipe. In comparison, our internal network uses fat GigE connections. Under ideal conditions, it takes 100x longer to transfer data to the public cloud than to our private cloud. Think about that for a second. Heck, think about it for 600 seconds: that’s how long it would take me to upload 750 MB, the total size of our install packages. And that’s best case. When’s the last time you hit the advertised upload speed on your Internet connection?

Transferring those files on our intranet requires a barely measurable 6 seconds:

Time (s) to transfer 750 MB

Adding that kind of delay to our CI builds is just not acceptable.

The public cloud is expensive

Many people assume that the public cloud will be cheaper than a private cloud. A day’s worth of compute time on Amazon EC2 costs less than a Starbucks latte, and you have no upfront cost, unlike a private cloud which has substantial upfront capital expenses. But it pays to run the numbers. In our case, the public cloud is more than twice the cost of a private cloud:

Public vs private cloud cost comparison

I split the costs into two buckets, because we have two fundamentally different usage models for the VM’s in our cloud. First are the systems used by our continuous integration server to run automated tests. Each CI build uses 12 Linux and 8 Windows systems, one for each supported platform. Our testing standards require that those systems are dual-core, but the work load is light since they just run unit tests and simple system tests. We have three such blocks of 20 systems, so we can run three CI builds simultaneously. Because the CI server never sleeps, these systems are always on.

Second are the systems used day-to-day by developers for testing and debugging. Each developer may use just a few systems, or more than a dozen depending on their needs. It’s hard to pin down the precise duty cycle, but eyeballing data from our cloud servers I estimate we have about 80 systems in use per day, for about 8 hours each. They are split roughly 50/50 between Linux and Windows. Two-thirds of the systems are single-core, and the rest are at least dual-core.

Pricing the public cloud

Once you know the type and quantity of VM’s you need, and for how long, it’s straightforward to compute the cost of the public cloud. Because I’m most familiar with Amazon EC2, I’ll use their pricing model. For our CI systems, we would use a mix of Medium and Large instances to match our requirements for multi-core and 64-bit support. Because they are always-on, we’d opt to use the Reserved instance pricing, which offers a lower hourly cost in exchange for a fixed up-front reservation fee.

For developer systems, we would use On-Demand instances, with a mix of Small and Large instances:

Continuous integration systems
Medium instances
Annual fee = $15,015.00 (33 systems at $455 per system)
Linux usage fee = $14,716.80 (21 systems, 24 hours, 365 days, $0.08 per hour per system)
Windows usage fee = $15,242.40 (12 systems, 24 hours, 365 days, $0.145 per hour per system)
Large instances
Annual fee = $24,570.00 (27 systems at $910 per system)
Linux usage fee = $21,024.00 (15 systems, 24 hours, 365 days, $0.16 per hour per system)
Windows usage fee = $25,228.80 (12 systems, 24 hours, 365 days, $0.24 per hour per system)
Subtotal = $115,797.00
Development systems
Small instances
Linux = $4,940.00 (26 systems, 8 hours, 250 days, $0.095 per hour per system)
Windows = $6,760.00 (26 systems, 8 hours, 250 days, $0.13 per hour per system)
Large instances
Linux = $10,640.00 (14 systems, 8 hours, 250 days, $0.38 per hour per system)
Windows = $15,600.00 (14 systems, 8 hours, 250 days, $0.52 per hour per system)
Subtotal = $37,940.00
Total = $153,737.00

Pricing the private cloud

It’s somewhat harder to compute the cost of a private cloud, because there is a greater variety of line-item costs, and they cannot all be easily calculated. The most obvious cost is that of the hardware itself. We use dual quad-core servers which cost about $3,000 each. Six of these servers host our CI VM’s. Note that this is only 48 physical cores, but our CI VM’s use a total of 120 virtual cores. This is called oversubscription, and it works because the load on the virtual cores is light — if each virtual core is active only 30-50% of the time, then one physical core can support 2-3 virtual cores.

We use 15 servers for our on-demand development VM’s. Unlike the CI systems, these VM’s are subject to heavy load, so we cannot oversubscribe the hardware to the same degree.

The next obvious cost is the electricity to power our servers, and of course the A/C costs to keep everything cool. Our electrical rate is about $0.17 per KWh, and we estimate the cooling cost at about 50% of the electrical cost.

Finally, we must consider the cost to maintain our 21 VM servers. To compute that amount, we must first determine how much of a sysadmin’s time will be spent managing these servers. Data from multiple sources shows that a sysadmin can maintain at least 100 servers, particularly if they are homogeneous as these are. Our servers therefore consume at most 21% of a sysadmin’s time.

Next, we have to determine the cost of the sysadmin’s time. I’m not privy to the actual numbers, but salary.com tells me that a top sysadmin in our area has a salary of about $90,000. The fully loaded cost of an employee is usually estimated at 2x the salary, for a total cost of $180,000 per year.

Here’s how it all adds up:

Continuous integration systems
Hardware = $6,000.00 (6 dual, quad-core systems at $3000 each, amortized over 3 years)
Personnel = $10,800.00 (6% of a fully-loaded sysadmin at $180,000)
Electricity = $3,082.65 (6 systems x 345w x 24 hours x 365 days x $0.17 per KWh)
Cooling = $1541.33 (50% of electricity cost)
Subtotal = $21,423.98
Development systems
Hardware = $15,000.00 (15 dual, quad-core systems at $3000 each, amortized over 3 years)
Personnel = $27,000.00 (15% of a fully-loaded sysadmin at $180,000)
Electricity = $7,706.61 (15 systems x 345w x 24 hours x 365 days x $0.17 per KWh)
Cooling = $3,853.31 (50% of electricity cost)
Subtotal = $53,559.92
Total = $74,983.90

Why is the public cloud so expensive?

I wasn’t surprised that the public cloud was more expensive, but I was surprised that it was that much more expensive. I had to figure out why it was so, and I think it comes down to two factors. First, we need 64-bit dual-core VM’s for our tests, but 64-bit support is only available on Large or better instances, which are at least 2x the cost of Medium instances. We would be forced to pay for more (virtual) hardware than we need.

Second, we benefit significantly by oversubscribing the hardware in our private cloud with 2.5 virtual cores per physical core. I have no doubt that Amazon is doing the same thing behind the scenes, but — and this is the real kicker — virtual cores in the public cloud are priced assuming a one-to-one virtual-to-physical ratio. Put another way, even though the public cloud provider is certainly oversubscribing their hardware and you’re only getting a fraction of a physical core for each virtual core, you still have to pay full price for those virtual cores. For all that increased hardware utilization is touted as a benefit of cloud computing, it only applies if you own the hardware.

Does it ever make sense to use the public cloud?

The results here are pretty dismal, but I think there are situations where the public cloud is the best choice. First, although private is cheaper in the long term, it requires a substantial upfront investment just to get off the ground — $63,000 for the hardware in our case. You may not have that kind of capital to work with.

Second, if your needs truly are “bursty”, the public cloud on-demand pricing is actually pretty competitive. Of course, you have to be really good about managing those VM’s — if you leave them powered on but idle, you still pay usage fees, which will quickly inflate your expenses.

Finally, if you’re just “testing the waters” to see if cloud computing will work for you, it’s definitely cheaper and easier to do that with a public cloud.

Private clouds for dev/test

Our private cloud has been a powerful enabling technology for my team. If you’re in a similar situation, you should seriously consider private versus public. You might be surprised to see how favorably the private cloud compares.

Cloud computing for traditional dev/test

Cloud computing has been all the rage lately. But most of the attention has focused on deployment of applications in the cloud, or at best, development of applications for the cloud. I haven’t seen much discussion about the ways that cloud computing can support development of “traditional” software — all that stuff that is not destined for cloud deployment.

Over the past two years, my engineering team has gradually migrated from a large collection of physical servers to a private development cloud, which has enabled us to support a rapidly increasing matrix of platforms and also improved development efficiency and developer happiness. I thought I’d share our experiences.

The bad old days

Two years ago, my development team had a server room stuffed full of rack-mounted computers — literally hundreds of 1U systems. At one point we determined that we had about 40 computers per developer. Seems outrageous, right? But we develop cluster-based software, so for a full system test (involving all major components) a developer needs at least three machines, and often 10 or more. And that was just for one developer, working on one platform. Consider that we have ten development and QA engineers, and that we support over 20 platforms (different flavors/versions of Windows and Linux), and you can see how quickly it adds up, even accounting for systems set up to dual- (or triple-, or quadruple-) boot.

This arrangement was functional, but just barely. The server closet was a nightmare of network, power and KVM cables. We had to retrofit it twice, once to bring in more power, and again to bring in more cooling. Maintaining the systems was a full-time job and then some: keeping everything up-to-date on patches, replacing dead or too-small disk drives, protecting against viruses. And just imagine the nightmare when a new OS came out — start with a cluster of machines configured to dual-boot XP and Server 2003, and then you want to add Server 2008 to the mix. First you have to repartition the drive, assuming it’s even big enough to accommodate all three. Then you have to reinstall the original two OS’s, and finally you can install the new OS. Multiply that by the number of machines and you’re looking at days or weeks of effort. Even if you use something like Ghost, inevitably you have a hodgepodge of hardware configurations, so you need to make multiple images.

And even with all the systems we had, we never seemed to have enough. Or rather, never enough of the right kind — when I needed to test on Windows, we only had Linux hosts available, or when I needed a multi-core system (which made up only a fraction of our total), I found they were all in use by my coworkers. Ironically, though we had hundreds of systems, most sat idle much of the time.

We had reached the point of crisis: we couldn’t squeeze any more systems into our server closet, nor any more operating systems on the systems we had. Something had to change.

So we got rid of all our servers.

Creating a private development cloud

Well ok, not all of them. Actually, we replaced our cornucopia of cheap computers with a couple dozen beefy servers — thanks to advances in hardware, we were able to get inexpensive 2u systems with 8 cores (dual-quads), a boatload of memory and large, fast disks. Then we put VMWare’s ESX Server on them, and started using virtual machines instead of physical for the bulk of our development and testing needs. We didn’t realize it at the time, but we had created a private development cloud.

This approach has a lot of advantages over physical systems, which will be familiar to anybody who’s followed the cloud computing trend:

  • Increased utilization: each VM server hosts 10-12 virtual machines; although many VM’s are idle at any given time, others are not, ensuring that there is at least some load on each of the physical cores that we do have.
  • Greater flexibility: each “slot” in our virtual infrastructure can host a VM of whatever flavor we need. It doesn’t matter if there are 10, 20 or 100 Linux VM’s already deployed by other developers: if I need a Linux system, I can get it.
  • Elasticity: I can grow and shrink my virtual cluster as needed, at the touch of a button. I no longer need to haggle with my coworkers for resources, or wait patiently for somebody to finish their tests.
  • Self-service and ease of use: adding a new test system in our old infrastructure was a major chore: requisition hardware, get IT involved to find a place to rack it and plug it in and install the OS or OSes. Best case scenario: days from the time I determine I need a new system to the time I can use it. With our private cloud, it’s literally as easy as visiting a web page, choosing the OS, the number of cores and amount of RAM and clicking a button. Ten minutes later I’m ready for business.
  • Reduced IT costs: instead of managing hundreds of computers, our IT department only maintains about 20 VM servers (which are all identical), and about 20 VM “templates” from which we create any number of VM instances. If a VM goes bad for any reason, we just discard it and regenerate from the template — nobody wastes time trying to “fix” a broken VM. Adding support for a new OS is dramatically easier: setup the single VM template with the new OS and publish it for use.

Lessons learned

Although things are working pretty well now, we had our share of difficulties in the transition. We didn’t have anybody in house with any particular experience with ESX Server, so there was a learning curve for that. One particular problem we had was figuring out how much disk space to allocate to each ESX server — we foolishly tried to lowball that axis, and we paid for that mistake with VM server downtime (and thus reduced cloud capacity) each time we realized we still had not allocated enough space. In short: get as much disk as you possibly can.

Another lesson learned was to avoid using the “Undeploy and save state” feature of ESX Server. That’s conceptually similar to suspending a system, versus powering it down, and it chews up storage space on the ESX Server, often for no good reason. And, we learned to avoid making clones of templates when deploying VM instances, again because it chews up storage space.

We also found that putting “too many” VM templates on a single disk partition caused significant filesystem lock contention, so we had to do some trial-and-error experimentation to find the “magic number” of templates per partition (it’s about 10, by the way).

Finally, we’ve found that although virtual machines fill the majority of our needs, we still need some physical machines, particularly for performance testing. Virtual machines are terrible for performance testing, first because it’s difficult to control the entire environment while running tests — the so-called noisy neighbors problem. Second, performance analysis, already an often arduous task, becomes nearly impossible by the addition of the extra complexity introduced by virtualization: not only do you need to be mindful of what’s happening on your VM, you must be aware of what’s happening on the VM server, and possibly what’s happening on other VM’s hosted on the same server.

Cloud computing for traditional dev/test

It was a bit of a rocky road to get where we are today, but I can say with confidence now that we absolutely made the right decision. Cloud computing is not just for scaling massive web applications: it is just as useful in traditional software development and test environments.

I wish I could quantify the positive impact on our development with data on improved quality or efficiency or reduced development time. I can’t. But I can make some concrete statements about the benefits we’ve enjoyed:

  • Most importantly, without our private cloud we would have been unable to grow our support matrix to the 20+ platforms it includes today.
  • Second, we reduced our IT cost by at least 6x, by reducing the number of systems that IT manages from (at least) 115 to just 21.
  • Finally, we cut our electrical bill by at least 5x from $50,000 per year (115 physical servers, 300 watt power supplies, running 24/7, at a cost of $0.17 per KWh), to just $11,000 per year (21 VM servers, 345 watt power supplies). Likewise, we reduced our cooling costs from about $25,000 per year to about $6,000 per year.

Beyond that, all I have is anecdotal evidence and the assurances of my teammates that “things are way better now.” For my part, the fact that I no longer have to arm wrestle my coworkers for access to resources makes it all worthwhile.