Pete Cheslock

DevOps, RelEng, DevTools, Automation, Randomness

Creating a Culture of Cost

When running a full physical infrastructure, the idea of cost is one that comes up during the procurement process. Calculating out the needs of the business, the expected growth, and purchasing systems with enough advanced notice that you will be able to meet those expected demands. I’ve been out of the physical infrastructure world for over 3 years now, but in the past we would purchase hardware, and depreciate that over a 3 year time period. If you were a particularly analytical person you could easily determine the per month, per day, and per hour cost of your infrastructure, but largely it would not matter much as they money has already been spent. When it comes time to add new services or applications you only need to determine if the existing systems have the available capacity (memory, cpu, disk IO) in order to support that application. If your existing systems can handle the load from the new app, then essentially the cost of that application is near to zero (not counting admin overhead).

When building and running an entire cloud based infrastructure - you can make similar calculations as you would in the physical world. The premise that if you have existing systems running on AWS (for example) that have excess capacity, you could add additional applications for essentially zero cost (since you are already paying for that running instance). The difference I find in the AWS world, is that you could potentially have your developers, operations team, and even your support team require instances for testing or for other short term technical needs. The process for new system acquisition is often just a command talking to an API which provisions your instance. The beauty of public clouds is that you can provision new instances quickly to satisfy Development or Support needs. In the physical world - new server acquisition would likely go thru an approval process that would likely require multiple levels of approval. The largest part of cloud cost containment is keeping a watchful eye on new instance usage to ensure you don’t have wasted/excess capacity.

In the past at Sonian, we had not invested much into visibility or process around managing our overall costs on AWS (and other public clouds). This resulted in a stratospheric AWS bill that did not represent our current capacity needs. To battle this, we created a small team of people including myself, our VP of Eng and our head of Support Ops to review our usage and cost. Since at the time there were not a lot of great apps that could help up correlate our instances across our multiple AWS accounts, we decided to build our own. The initial iteration was to simply correlate instances in AWS with our instances in the Opscode Chef platform. Since we had heavy Chef integration, if a node was on AWS but was not registered with Chef it was likely a node that failed and/or could be terminated. We continued to improve and build more features into our cost tracking application in order to find more unused or underutilized instances.

Over the next few months, we continued to review our usage, clean up and removed unused instances and disk, and even redesign our reference architecture taking input from all areas of development and operations. Just a few months after assigning ownership and accountability to managing our AWS costs we saw a decrease in our AWS bill by over 50%. And when you deal with the amount of scale that we deal with, it equates to massive cost savings that has benefited the entire company.

Since then and even up to today, the precision and ownership of managing our cost containment has trickled down to all parts of our Engineering and Operations teams. When discussing new features or changes to our systems or code, you will often hear our software developers discuss the per hour pricing of m1.large nodes and adjust their code to support excess capacity we may have with m1.medium nodes (for example). This has continued on in the usage of our metrics hosting provider Librato. Since we pay per metric data point, we keep a close eye on the metrics we are tracking, ensuring they have meaning to us and that they are actionable. Personally I believe this keeps the quality of the data we are tracking high while keeping our costs low. If we were in a physical world (or even did a Graphite setup on AWS) we would still have the same cost per metric stored idea, but with slightly different cost basis/model.

In order to create a culture of cost, you must have ownership and accountability on your cloud assets. Just because you are pushing infrastructure to an IAAS provider does not mean you can ignore managing those assets. If you are struggling with a massive AWS bill you may have just received, maybe you should ask the question, “Who owns this?”. Accountability and ownership are key to the success of any project and this is no different.

But for us, the proof is in the AWS bill. We were able to decrease our costs by over 50% applying ownership. We are able to maintain our AWS costs while increasing our data storage and object counts by 300%. Ownership leads to people taking the time to understand reserved instance costs, spot node usage scenarios and other AWS cost savings features. And ensures that everyone at the company, Devs, Ops, Support, PM and even our Sales team will understand our costs and usage of public IAAS providers.