Cloud computing has been all the rage lately. But most of the attention has focused on deployment of applications in the cloud, or at best, development of applications for the cloud. I haven’t seen much discussion about the ways that cloud computing can support development of “traditional” software — all that stuff that is not destined for cloud deployment.
Over the past two years, my engineering team has gradually migrated from a large collection of physical servers to a private development cloud, which has enabled us to support a rapidly increasing matrix of platforms and also improved development efficiency and developer happiness. I thought I’d share our experiences.
The bad old days
Two years ago, my development team had a server room stuffed full of rack-mounted computers — literally hundreds of 1U systems. At one point we determined that we had about 40 computers per developer. Seems outrageous, right? But we develop cluster-based software, so for a full system test (involving all major components) a developer needs at least three machines, and often 10 or more. And that was just for one developer, working on one platform. Consider that we have ten development and QA engineers, and that we support over 20 platforms (different flavors/versions of Windows and Linux), and you can see how quickly it adds up, even accounting for systems set up to dual- (or triple-, or quadruple-) boot.
This arrangement was functional, but just barely. The server closet was a nightmare of network, power and KVM cables. We had to retrofit it twice, once to bring in more power, and again to bring in more cooling. Maintaining the systems was a full-time job and then some: keeping everything up-to-date on patches, replacing dead or too-small disk drives, protecting against viruses. And just imagine the nightmare when a new OS came out — start with a cluster of machines configured to dual-boot XP and Server 2003, and then you want to add Server 2008 to the mix. First you have to repartition the drive, assuming it’s even big enough to accommodate all three. Then you have to reinstall the original two OS’s, and finally you can install the new OS. Multiply that by the number of machines and you’re looking at days or weeks of effort. Even if you use something like Ghost, inevitably you have a hodgepodge of hardware configurations, so you need to make multiple images.
And even with all the systems we had, we never seemed to have enough. Or rather, never enough of the right kind — when I needed to test on Windows, we only had Linux hosts available, or when I needed a multi-core system (which made up only a fraction of our total), I found they were all in use by my coworkers. Ironically, though we had hundreds of systems, most sat idle much of the time.
We had reached the point of crisis: we couldn’t squeeze any more systems into our server closet, nor any more operating systems on the systems we had. Something had to change.
So we got rid of all our servers.
Creating a private development cloud
Well ok, not all of them. Actually, we replaced our cornucopia of cheap computers with a couple dozen beefy servers — thanks to advances in hardware, we were able to get inexpensive 2u systems with 8 cores (dual-quads), a boatload of memory and large, fast disks. Then we put VMWare’s ESX Server on them, and started using virtual machines instead of physical for the bulk of our development and testing needs. We didn’t realize it at the time, but we had created a private development cloud.
This approach has a lot of advantages over physical systems, which will be familiar to anybody who’s followed the cloud computing trend:
- Increased utilization: each VM server hosts 10-12 virtual machines; although many VM’s are idle at any given time, others are not, ensuring that there is at least some load on each of the physical cores that we do have.
- Greater flexibility: each “slot” in our virtual infrastructure can host a VM of whatever flavor we need. It doesn’t matter if there are 10, 20 or 100 Linux VM’s already deployed by other developers: if I need a Linux system, I can get it.
- Elasticity: I can grow and shrink my virtual cluster as needed, at the touch of a button. I no longer need to haggle with my coworkers for resources, or wait patiently for somebody to finish their tests.
- Self-service and ease of use: adding a new test system in our old infrastructure was a major chore: requisition hardware, get IT involved to find a place to rack it and plug it in and install the OS or OSes. Best case scenario: days from the time I determine I need a new system to the time I can use it. With our private cloud, it’s literally as easy as visiting a web page, choosing the OS, the number of cores and amount of RAM and clicking a button. Ten minutes later I’m ready for business.
- Reduced IT costs: instead of managing hundreds of computers, our IT department only maintains about 20 VM servers (which are all identical), and about 20 VM “templates” from which we create any number of VM instances. If a VM goes bad for any reason, we just discard it and regenerate from the template — nobody wastes time trying to “fix” a broken VM. Adding support for a new OS is dramatically easier: setup the single VM template with the new OS and publish it for use.
Although things are working pretty well now, we had our share of difficulties in the transition. We didn’t have anybody in house with any particular experience with ESX Server, so there was a learning curve for that. One particular problem we had was figuring out how much disk space to allocate to each ESX server — we foolishly tried to lowball that axis, and we paid for that mistake with VM server downtime (and thus reduced cloud capacity) each time we realized we still had not allocated enough space. In short: get as much disk as you possibly can.
Another lesson learned was to avoid using the “Undeploy and save state” feature of ESX Server. That’s conceptually similar to suspending a system, versus powering it down, and it chews up storage space on the ESX Server, often for no good reason. And, we learned to avoid making clones of templates when deploying VM instances, again because it chews up storage space.
We also found that putting “too many” VM templates on a single disk partition caused significant filesystem lock contention, so we had to do some trial-and-error experimentation to find the “magic number” of templates per partition (it’s about 10, by the way).
Finally, we’ve found that although virtual machines fill the majority of our needs, we still need some physical machines, particularly for performance testing. Virtual machines are terrible for performance testing, first because it’s difficult to control the entire environment while running tests — the so-called noisy neighbors problem. Second, performance analysis, already an often arduous task, becomes nearly impossible by the addition of the extra complexity introduced by virtualization: not only do you need to be mindful of what’s happening on your VM, you must be aware of what’s happening on the VM server, and possibly what’s happening on other VM’s hosted on the same server.
Cloud computing for traditional dev/test
It was a bit of a rocky road to get where we are today, but I can say with confidence now that we absolutely made the right decision. Cloud computing is not just for scaling massive web applications: it is just as useful in traditional software development and test environments.
I wish I could quantify the positive impact on our development with data on improved quality or efficiency or reduced development time. I can’t. But I can make some concrete statements about the benefits we’ve enjoyed:
- Most importantly, without our private cloud we would have been unable to grow our support matrix to the 20+ platforms it includes today.
- Second, we reduced our IT cost by at least 6x, by reducing the number of systems that IT manages from (at least) 115 to just 21.
- Finally, we cut our electrical bill by at least 5x from $50,000 per year (115 physical servers, 300 watt power supplies, running 24/7, at a cost of $0.17 per KWh), to just $11,000 per year (21 VM servers, 345 watt power supplies). Likewise, we reduced our cooling costs from about $25,000 per year to about $6,000 per year.
Beyond that, all I have is anecdotal evidence and the assurances of my teammates that “things are way better now.” For my part, the fact that I no longer have to arm wrestle my coworkers for access to resources makes it all worthwhile.