Public versus private clouds for dev/test

I recently wrote about our experience migrating to cloud computing to support development and QA activities. Our cloud enables us to support more platforms, at lower cost, and with less complexity than the fleet of physical servers it replaced. But I didn’t have room to talk about one important decision in our migration: whether to build a private cloud or use an existing public cloud like Amazon EC2 or Rackspace Cloud.

For us, the decision was easy. The public cloud is unsuitable for three reasons: platforms, bandwidth and money. First, the public cloud doesn’t support the platforms we need for testing. Second, uploading data to the public cloud takes way too long by today’s agile, continuous development standards. Finally, and probably most interesting to you, the public cloud is surprisingly expensive. In fact, I estimate that the public cloud would cost us more than twice as much as our private cloud, every year.

Public clouds don’t support all of our platforms

My product is supported on a smörgåsbord of x86-based platforms — various incarnations of Windows, from XP to Vista to Windows 7; and a variety of Linux distributions from RHEL4 to Ubuntu 10. Our quality standards demand that we run the platform-dependent portion of our test suite on every supported platform. Pretty standard stuff, I imagine. Too bad for us then that you can’t run XP, Vista or 7 in the cloud (see also here and here).

Bandwidth to the public cloud stinks

My company is connected to the Internet via a puny 10 Mbit EoC pipe. In comparison, our internal network uses fat GigE connections. Under ideal conditions, it takes 100x longer to transfer data to the public cloud than to our private cloud. Think about that for a second. Heck, think about it for 600 seconds: that’s how long it would take me to upload 750 MB, the total size of our install packages. And that’s best case. When’s the last time you hit the advertised upload speed on your Internet connection?

Transferring those files on our intranet requires a barely measurable 6 seconds:

Time (s) to transfer 750 MB

Adding that kind of delay to our CI builds is just not acceptable.

The public cloud is expensive

Many people assume that the public cloud will be cheaper than a private cloud. A day’s worth of compute time on Amazon EC2 costs less than a Starbucks latte, and you have no upfront cost, unlike a private cloud which has substantial upfront capital expenses. But it pays to run the numbers. In our case, the public cloud is more than twice the cost of a private cloud:

Public vs private cloud cost comparison

I split the costs into two buckets, because we have two fundamentally different usage models for the VM’s in our cloud. First are the systems used by our continuous integration server to run automated tests. Each CI build uses 12 Linux and 8 Windows systems, one for each supported platform. Our testing standards require that those systems are dual-core, but the work load is light since they just run unit tests and simple system tests. We have three such blocks of 20 systems, so we can run three CI builds simultaneously. Because the CI server never sleeps, these systems are always on.

Second are the systems used day-to-day by developers for testing and debugging. Each developer may use just a few systems, or more than a dozen depending on their needs. It’s hard to pin down the precise duty cycle, but eyeballing data from our cloud servers I estimate we have about 80 systems in use per day, for about 8 hours each. They are split roughly 50/50 between Linux and Windows. Two-thirds of the systems are single-core, and the rest are at least dual-core.

Pricing the public cloud

Once you know the type and quantity of VM’s you need, and for how long, it’s straightforward to compute the cost of the public cloud. Because I’m most familiar with Amazon EC2, I’ll use their pricing model. For our CI systems, we would use a mix of Medium and Large instances to match our requirements for multi-core and 64-bit support. Because they are always-on, we’d opt to use the Reserved instance pricing, which offers a lower hourly cost in exchange for a fixed up-front reservation fee.

For developer systems, we would use On-Demand instances, with a mix of Small and Large instances:

Continuous integration systems
Medium instances
Annual fee = $15,015.00 (33 systems at $455 per system)
Linux usage fee = $14,716.80 (21 systems, 24 hours, 365 days, $0.08 per hour per system)
Windows usage fee = $15,242.40 (12 systems, 24 hours, 365 days, $0.145 per hour per system)
Large instances
Annual fee = $24,570.00 (27 systems at $910 per system)
Linux usage fee = $21,024.00 (15 systems, 24 hours, 365 days, $0.16 per hour per system)
Windows usage fee = $25,228.80 (12 systems, 24 hours, 365 days, $0.24 per hour per system)
Subtotal = $115,797.00
Development systems
Small instances
Linux = $4,940.00 (26 systems, 8 hours, 250 days, $0.095 per hour per system)
Windows = $6,760.00 (26 systems, 8 hours, 250 days, $0.13 per hour per system)
Large instances
Linux = $10,640.00 (14 systems, 8 hours, 250 days, $0.38 per hour per system)
Windows = $15,600.00 (14 systems, 8 hours, 250 days, $0.52 per hour per system)
Subtotal = $37,940.00
Total = $153,737.00

Pricing the private cloud

It’s somewhat harder to compute the cost of a private cloud, because there is a greater variety of line-item costs, and they cannot all be easily calculated. The most obvious cost is that of the hardware itself. We use dual quad-core servers which cost about $3,000 each. Six of these servers host our CI VM’s. Note that this is only 48 physical cores, but our CI VM’s use a total of 120 virtual cores. This is called oversubscription, and it works because the load on the virtual cores is light — if each virtual core is active only 30-50% of the time, then one physical core can support 2-3 virtual cores.

We use 15 servers for our on-demand development VM’s. Unlike the CI systems, these VM’s are subject to heavy load, so we cannot oversubscribe the hardware to the same degree.

The next obvious cost is the electricity to power our servers, and of course the A/C costs to keep everything cool. Our electrical rate is about $0.17 per KWh, and we estimate the cooling cost at about 50% of the electrical cost.

Finally, we must consider the cost to maintain our 21 VM servers. To compute that amount, we must first determine how much of a sysadmin’s time will be spent managing these servers. Data from multiple sources shows that a sysadmin can maintain at least 100 servers, particularly if they are homogeneous as these are. Our servers therefore consume at most 21% of a sysadmin’s time.

Next, we have to determine the cost of the sysadmin’s time. I’m not privy to the actual numbers, but tells me that a top sysadmin in our area has a salary of about $90,000. The fully loaded cost of an employee is usually estimated at 2x the salary, for a total cost of $180,000 per year.

Here’s how it all adds up:

Continuous integration systems
Hardware = $6,000.00 (6 dual, quad-core systems at $3000 each, amortized over 3 years)
Personnel = $10,800.00 (6% of a fully-loaded sysadmin at $180,000)
Electricity = $3,082.65 (6 systems x 345w x 24 hours x 365 days x $0.17 per KWh)
Cooling = $1541.33 (50% of electricity cost)
Subtotal = $21,423.98
Development systems
Hardware = $15,000.00 (15 dual, quad-core systems at $3000 each, amortized over 3 years)
Personnel = $27,000.00 (15% of a fully-loaded sysadmin at $180,000)
Electricity = $7,706.61 (15 systems x 345w x 24 hours x 365 days x $0.17 per KWh)
Cooling = $3,853.31 (50% of electricity cost)
Subtotal = $53,559.92
Total = $74,983.90

Why is the public cloud so expensive?

I wasn’t surprised that the public cloud was more expensive, but I was surprised that it was that much more expensive. I had to figure out why it was so, and I think it comes down to two factors. First, we need 64-bit dual-core VM’s for our tests, but 64-bit support is only available on Large or better instances, which are at least 2x the cost of Medium instances. We would be forced to pay for more (virtual) hardware than we need.

Second, we benefit significantly by oversubscribing the hardware in our private cloud with 2.5 virtual cores per physical core. I have no doubt that Amazon is doing the same thing behind the scenes, but — and this is the real kicker — virtual cores in the public cloud are priced assuming a one-to-one virtual-to-physical ratio. Put another way, even though the public cloud provider is certainly oversubscribing their hardware and you’re only getting a fraction of a physical core for each virtual core, you still have to pay full price for those virtual cores. For all that increased hardware utilization is touted as a benefit of cloud computing, it only applies if you own the hardware.

Does it ever make sense to use the public cloud?

The results here are pretty dismal, but I think there are situations where the public cloud is the best choice. First, although private is cheaper in the long term, it requires a substantial upfront investment just to get off the ground — $63,000 for the hardware in our case. You may not have that kind of capital to work with.

Second, if your needs truly are “bursty”, the public cloud on-demand pricing is actually pretty competitive. Of course, you have to be really good about managing those VM’s — if you leave them powered on but idle, you still pay usage fees, which will quickly inflate your expenses.

Finally, if you’re just “testing the waters” to see if cloud computing will work for you, it’s definitely cheaper and easier to do that with a public cloud.

Private clouds for dev/test

Our private cloud has been a powerful enabling technology for my team. If you’re in a similar situation, you should seriously consider private versus public. You might be surprised to see how favorably the private cloud compares.

Cloud computing for traditional dev/test

Cloud computing has been all the rage lately. But most of the attention has focused on deployment of applications in the cloud, or at best, development of applications for the cloud. I haven’t seen much discussion about the ways that cloud computing can support development of “traditional” software — all that stuff that is not destined for cloud deployment.

Over the past two years, my engineering team has gradually migrated from a large collection of physical servers to a private development cloud, which has enabled us to support a rapidly increasing matrix of platforms and also improved development efficiency and developer happiness. I thought I’d share our experiences.

The bad old days

Two years ago, my development team had a server room stuffed full of rack-mounted computers — literally hundreds of 1U systems. At one point we determined that we had about 40 computers per developer. Seems outrageous, right? But we develop cluster-based software, so for a full system test (involving all major components) a developer needs at least three machines, and often 10 or more. And that was just for one developer, working on one platform. Consider that we have ten development and QA engineers, and that we support over 20 platforms (different flavors/versions of Windows and Linux), and you can see how quickly it adds up, even accounting for systems set up to dual- (or triple-, or quadruple-) boot.

This arrangement was functional, but just barely. The server closet was a nightmare of network, power and KVM cables. We had to retrofit it twice, once to bring in more power, and again to bring in more cooling. Maintaining the systems was a full-time job and then some: keeping everything up-to-date on patches, replacing dead or too-small disk drives, protecting against viruses. And just imagine the nightmare when a new OS came out — start with a cluster of machines configured to dual-boot XP and Server 2003, and then you want to add Server 2008 to the mix. First you have to repartition the drive, assuming it’s even big enough to accommodate all three. Then you have to reinstall the original two OS’s, and finally you can install the new OS. Multiply that by the number of machines and you’re looking at days or weeks of effort. Even if you use something like Ghost, inevitably you have a hodgepodge of hardware configurations, so you need to make multiple images.

And even with all the systems we had, we never seemed to have enough. Or rather, never enough of the right kind — when I needed to test on Windows, we only had Linux hosts available, or when I needed a multi-core system (which made up only a fraction of our total), I found they were all in use by my coworkers. Ironically, though we had hundreds of systems, most sat idle much of the time.

We had reached the point of crisis: we couldn’t squeeze any more systems into our server closet, nor any more operating systems on the systems we had. Something had to change.

So we got rid of all our servers.

Creating a private development cloud

Well ok, not all of them. Actually, we replaced our cornucopia of cheap computers with a couple dozen beefy servers — thanks to advances in hardware, we were able to get inexpensive 2u systems with 8 cores (dual-quads), a boatload of memory and large, fast disks. Then we put VMWare’s ESX Server on them, and started using virtual machines instead of physical for the bulk of our development and testing needs. We didn’t realize it at the time, but we had created a private development cloud.

This approach has a lot of advantages over physical systems, which will be familiar to anybody who’s followed the cloud computing trend:

  • Increased utilization: each VM server hosts 10-12 virtual machines; although many VM’s are idle at any given time, others are not, ensuring that there is at least some load on each of the physical cores that we do have.
  • Greater flexibility: each “slot” in our virtual infrastructure can host a VM of whatever flavor we need. It doesn’t matter if there are 10, 20 or 100 Linux VM’s already deployed by other developers: if I need a Linux system, I can get it.
  • Elasticity: I can grow and shrink my virtual cluster as needed, at the touch of a button. I no longer need to haggle with my coworkers for resources, or wait patiently for somebody to finish their tests.
  • Self-service and ease of use: adding a new test system in our old infrastructure was a major chore: requisition hardware, get IT involved to find a place to rack it and plug it in and install the OS or OSes. Best case scenario: days from the time I determine I need a new system to the time I can use it. With our private cloud, it’s literally as easy as visiting a web page, choosing the OS, the number of cores and amount of RAM and clicking a button. Ten minutes later I’m ready for business.
  • Reduced IT costs: instead of managing hundreds of computers, our IT department only maintains about 20 VM servers (which are all identical), and about 20 VM “templates” from which we create any number of VM instances. If a VM goes bad for any reason, we just discard it and regenerate from the template — nobody wastes time trying to “fix” a broken VM. Adding support for a new OS is dramatically easier: setup the single VM template with the new OS and publish it for use.

Lessons learned

Although things are working pretty well now, we had our share of difficulties in the transition. We didn’t have anybody in house with any particular experience with ESX Server, so there was a learning curve for that. One particular problem we had was figuring out how much disk space to allocate to each ESX server — we foolishly tried to lowball that axis, and we paid for that mistake with VM server downtime (and thus reduced cloud capacity) each time we realized we still had not allocated enough space. In short: get as much disk as you possibly can.

Another lesson learned was to avoid using the “Undeploy and save state” feature of ESX Server. That’s conceptually similar to suspending a system, versus powering it down, and it chews up storage space on the ESX Server, often for no good reason. And, we learned to avoid making clones of templates when deploying VM instances, again because it chews up storage space.

We also found that putting “too many” VM templates on a single disk partition caused significant filesystem lock contention, so we had to do some trial-and-error experimentation to find the “magic number” of templates per partition (it’s about 10, by the way).

Finally, we’ve found that although virtual machines fill the majority of our needs, we still need some physical machines, particularly for performance testing. Virtual machines are terrible for performance testing, first because it’s difficult to control the entire environment while running tests — the so-called noisy neighbors problem. Second, performance analysis, already an often arduous task, becomes nearly impossible by the addition of the extra complexity introduced by virtualization: not only do you need to be mindful of what’s happening on your VM, you must be aware of what’s happening on the VM server, and possibly what’s happening on other VM’s hosted on the same server.

Cloud computing for traditional dev/test

It was a bit of a rocky road to get where we are today, but I can say with confidence now that we absolutely made the right decision. Cloud computing is not just for scaling massive web applications: it is just as useful in traditional software development and test environments.

I wish I could quantify the positive impact on our development with data on improved quality or efficiency or reduced development time. I can’t. But I can make some concrete statements about the benefits we’ve enjoyed:

  • Most importantly, without our private cloud we would have been unable to grow our support matrix to the 20+ platforms it includes today.
  • Second, we reduced our IT cost by at least 6x, by reducing the number of systems that IT manages from (at least) 115 to just 21.
  • Finally, we cut our electrical bill by at least 5x from $50,000 per year (115 physical servers, 300 watt power supplies, running 24/7, at a cost of $0.17 per KWh), to just $11,000 per year (21 VM servers, 345 watt power supplies). Likewise, we reduced our cooling costs from about $25,000 per year to about $6,000 per year.

Beyond that, all I have is anecdotal evidence and the assurances of my teammates that “things are way better now.” For my part, the fact that I no longer have to arm wrestle my coworkers for access to resources makes it all worthwhile.