ericsysmin's DevOps Blog

How to cloud!

I need to get something off of my mind. It seems a lot of people out there, in many organizations seem to think that “cloud” is just a word still thrown around, and that cloud migrations will likely fail.

I am here to tell you it’s very possible, and doable, and could actually work better than your current system. First I need to make sure everyone understands the cloud. Odds are once you are migrated, if you follow these rules your app would perform better and run better.

  • Cloud migrations aren’t just an Operations thing
  • Plan for everything to fail
  • Centralize everything
  • Databases and other resources that don’t scale require clustering
  • What are your pets, and what are your cattle?
  • Load Test

These topics are so very important, and if you fail to do them you will have serious problems down the road. I’ve done enough applications in the cloud to understand that if you skip certain things you will have problems.

Cloud migrations aren’t just an Operations thing

For some reason many companies, and individuals believe that it’s an operations thing to move to the cloud. It’s not just an ops thing. I am here to convince you why it’s not. When you write an application we expect our dependencies to work, and stay working, we don’t expect them to fail. Here is where we have a problem. It will fail. Dependencies always fail, that 3rd party API you talk too will probably go down at some point. What happens if it does? I’ve seen applications literally crash because they can’t connect to another resource. When I mean crash, I mean hard crash, needing a restart of Tomcat, Java, Apache, Nginx, whatever it is. IMO, it’s the dumbest thing that can happen.

How do we attack this problem? First we talk to the developers. Does the 3rd party resource provide a “i got your message”, “now here is your response”. If it fails (we cant connect/server error), lets try sending it agian (fail quickly, retry often). Now if that resource fails, instead of crashing the entire app, what if we instead log the error, and allow the rest of the application to function correctly. If the resource is unavailable it’s ok to tell a user, “hey, this action has failed, and has been logged, operations has been notified”. If you centralize your logging and monitoring hopefully you have your logs sending alerts on failures above a percentage. Splunk and ElasticSearch are great tools for this.

Every company I have worked in has this issue. It’s just oversight in development. Developers normally expect things to work, in the cloud you need to expect that they won’t always work. It’s a complete mind change, but in the end it will give you a better performing service outside of the cloud. Remember, you aren’t down if it’s not yours. Its easier to say hey, Google Maps API is down, we already contacted them and they are working on it, than saying Google Maps is down, and our app is entirely down because of it. People often can deal with a feature being unavailable than your entire application being offline.

On a side note from experience, I’ve seen a minor dependency of an Authentication API actually crash the entire Authentication API which then crashed over 150 applications….I mean that dependency was only used by barely 5% of total users of the Authentication API. Cascading outages are common, and little issues can blow up into big ones without doing better error handling.

Plan for everything to fail

I will tell you now, you will have servers fail. They will die. They will die a horrible horrible death, one that you can possibly lose anything on those servers. You can try to save them, but sometimes there’s nothing you can do. Because of this, the wonderful people at Netflix years ago created a tool called ChaosMonkey ( This tool will actually mimic failures of Availability zones, hell, you could have it bring down a region if you wanted to.

You need to monitor everything. When I mean everything, I mean all of your servers, all of your applications health status’s, and your application dependencies (third party API’s, etc.). If you are only in AWS, I’d recommend CloudWatch, it works great, you can Autoscale, and set policies, as well as integrate with SNS which can notify you on errors, and actions performed by your hopefully future “automated cloud”.  If you must you another tool, learn about your tools APIs. Monitoring tools now a-days have RESTful APIs. If your tool does not, complain to the creators cause they are way behind.

You servers will fail, it’s a fact. You need to have a way to recreate them, and I mean all of them. Don’t build an application without a way to rebuild a server. This is where Configuration Management is extremely valuable. Ansible, Salt, Puppet, Chef, CFEngine, it doesn’t matter, as long as you can use it to rebuild servers from scratch. image based systems work. For example, for AWS Autoscaling all you do is deploy your app to an instance, then build an image from that instance, which is then used by Autoscaling to create the new hosts, all of which are exactly the same.

Centralize Everything

Look, I know how you love logging into servers and doing that tail -f /var/log/myapp/my.log  stuff, but we need to move on. Centralize your logging, what would you do if that server died and you can’t connect to it anymore, that means those valuable logs are lost. They are gone. 🙁 Now your analytics team has bad data. You could of missed some errors that occurred. So centralize your logging, it just works better, and if your logging is slow, scale it correctly so it’s not, of course you may have to tune your agents or tune your Syslog on the server. There’s tons of books on it, and you can really do magic with that data when it’s centralized.

It’s also important to run your deployments from a single location. I have no idea why there are people out there that still think deploying from your desktop is acceptable. You basically make that persons computer a fail-point in the deployment process. How does it feel to tell the Product Owner that they can’t deploy the application to production because the Ops guy’s computer is down. Create a jump server, and cluster it, or create 2 jump servers, make sure scripts are replicated, either by Git pull on both servers, or something. If one dies, you can still do your deployment, and if the ops guy is fired, the process can easily continue. This also helps in securing your environment, meaning I can make it so only those two deployment servers have access to your servers and infrastructure. This also is a huge thing for PCI and Security Auditing, and will save you a little embarrassment from the security team, or your auditor.

Configuration Management, when centralized also helps. It will help manage those MySQL servers, etc, and keep access, and everything else consistent. You don’t need autoscaled, and “cattle” servers using centralized configuration management. Actually, it’s a terrible idea, to use centralized configuration management on your “cattle”. It will lead to complications on scaling, and authenticating your “cattle” to your centralized configuration management. But you can use configuration management to build the initial image or based used for your “cattle”.

Centralize your scripts. Put your scripts that are used for deployments into repositories. Subversion, Mercurial, Git, CVS, GitHub, any of them work. This helps you know where to find your scripts instead of looking around on your servers, it also offers the ability to do peer review on code committed and changes to those scripts.

Databases and other resources that don’t scale require clustering

We all have databases, we love databases, the only problem is they need to be taken care of. In the cloud, unfortunately we can’t guarantee that the server hosting that database will be up. So luckily everyone has already came up with solutions to that problem. We can now cluster applications, do master-master, and master-slave setups. <<< DON’T SKIP THIS. In AWS we can do special instances that have 10GbE links to handle fast replication of databases if needed.

Some servers and monitoring tools offer clustering. Use it. Pacemaker and Corosync are great tools to help. You can cluster Nginx and Apache servers using Pacemaker and Corosync.

One of these servers may go down, and to save you the panic, it’s always great to have another instantly take charge.

What are your pets, and what are your cattle?

Pets are your servers you need to take care of, cattle is all the servers that are the same, usually serving up your app. You need to identify these. What servers can you treat like cattle? The more cattle you have the easier it is to scale them, and grow. Pets require more time, and not many of us get the benefit of time…for some reason everyone seems understaffed.

How do you determine this….well I can make it easy at least for some of your servers. If you have a database, for the most part those are pets. However times are changing, maybe someone has a project out there that allows you to simply change a number and boom, you have another replica to your DB. Also monitoring tools like Zabbix, that unfortunately you can’t have duplicate servers running at the same time, so you need to cluster it. That’s considered a pet. You need to take care of it, maintain it. Any kind of server that you can’t treat like cattle is a pet. Easy way to do this is write all the components down go through each one and say “can I treat it like cattle.”

Load Test

For some reason people think hey, lets just choose this size for our instances, and that’s what they use. I don’t know where this idea came from but it’s the most ridiculous and non-efficient way to develop an application. So what’s the proper way to choose an instance type/size. Well we test, and test, and test. When the app is being developed and in testing, try different instance sizes. A java application that you think needs a lot of memory might actually need more CPU because of garbage collection. This is stuff you’d only find out if you load test your boxes. Work with your QA team to do these tests. Choosing a random instance based on what you believe will be right is not efficient because it’s untested. Also, when you are testing figure out what your minimum requirements are during load testing. Do you need to have 8GB ram on a server? How does it run with 4GB? Pick a good size you can stick with. Vertical scaling is very difficult to manage when compared to horizontal scaling. Horizontal scaling can literally be done by changing numbers, while vertical scaling requires a shutdown and restart of your servers. It also gives you a perfect measurement on what your devs test environments should look like. Often times developers will ask “can we add more ram?”, “can we add more cpu”. Don’t just do it for them. There needs to be a serious reason to change. When you are managing cost effectiveness in the cloud you need to understand that if you can guarantee you will use a single type of instance for your application, you can save 50% to 60% on instance costs with reserved instances. Changing the instance size makes it impossible to do this in your budget. Thus you will spend 50% to 60% more on AWS. Also, use math to calculate total traffic and users on your site. With a few servers see how much traffic it can support. Then talk to the teams to find out how much traffic is expected in staging, and how much is expected in production. Use math to figure out how many servers you will need to support that traffic. Later when you are in prod and have some good base-line metrics you can creating scaling mechanisms to handle growth and decreases in traffic. Scaling will help you keep costs down.


There is a lot to making cloud migrations, and cloud infrastructure work well. But the results can be amazing, and with APIs and automation you can manage massive infrastructures with minimal effort and support. I am sure there is more I can add, these were just key points I wanted to make.

Continue reading...

Automate Everything

Ok, almost automate everything. I do recognize that this far into DevOps there are some things you can’t quite automate. But for the most part you can stick to these few rules. If your situation matches any of these, you could probably automate it with a script.

  • The same actions are performed often, more than 3 times.
  • You want to outsource advanced actions to a tier 1 team.
  • You want to allow developers or non-technical people perform the actions.
  • You want a way to make sure the same actions are done whenever something is ran
  • Source control has a huge benefit, I script just about everything, just so I can upload it to source control and roll back a script if there’s a bad change.

Some languages that are really common in DevOps land include Bash, Python, Powershell (my Windows friends), and Ruby, however, I have seen DevOpsers use Java, Perl, Go, and many other languages. Over 6 years and 3 companies, I’ve only seen Bash, Python, Powershell, and Ruby used within those companies.

Continue reading...

Pets vs. Cattle

Earlier I had mentioned about Pets vs Cattle, and one of my recent friends asked me what it actually means. Pets vs Cattle is a concept. Recently brought on by the advent of cloud computing and managing massive infrastructures which can be easily automated. Its extremely different than traditional systems administration, which was born from physical machines where you may need to purchase new hardware, build it, then deploy it within your datacenter.


First you give them a name. Then you installed services on that server, and if the server got corrupted or hardware failed you had to repair the computer, possibly rebuild the hard drive, fix corrupted filesystems, I could go on essentially you nursed it back to health. This is referred to as "taking care of a pet", you take care of it, you update, upgrade, and patch it. Some applications still need this attitude, but many in this decade do not! Now we've changed, and we with RESTful APIs can quickly spin up infrastructure, and build from scratch to a working machine in minutes.


Now with cloud computing we are able to do amazing things. I can easily build a "golden image" from a running application, even build that image from scratch and use it in an autoscaled environment (something Netflix has done for years!). Using tools like Ansible, Puppet, Salt, and Chef, you can quickly build new infrastructure that is known to work. You can even use those tools to build a brand new image which can then be used to create 100 servers just with one command line entry. Instead of names these guys get numbers. All servers are essentially identical to each other. If a server dies, you issue a couple API calls (or not if you are using AWS Autoscaling or similar, as it does it for you if you want it to), and now you have replaced that server in your environment. If a cow is ill/dying you kill it, and get another. That's where this analogy originates.
Continue reading...

DevOps Really…..

What is DevOps?

DevOps is not one thing, its a plethora of ideas, and practices that have revolutionized Operations and the Release Process. It’s a culture change. DevOps is using development practices and applying them to operations. Then using those methods to automate everything.  It starts to join both Operations and Development together, consider it as a hybrid, I mean it is DevOps, it’s the joining of Development and Operations. DevOps was created to empower developers to quickly integrate and build environments to test code, and allows the operations teams to quickly deploy and monitor applications just as fast as the developers and QA teams can approve it, thus leading to Continuous Integration and Continuous Deployment. DevOps teams that I have started tend to do well with another great practice called Scrum, which are just as important to operations as it is to development, if you don’t believe me try it for a month, then see how effective your team starts becoming. As you may have noticed many of these things are part of Agile practices, if you aren’t familiar hopefully reading and following this blog you will be. DevOps and Agile go hand in hand. DevOps helps reach the goals of Agile effectively, also much easier. You don’t want Operations to be a blocker….that looks bad…wait…it looks really bad.

Some extremely popular concepts come into play here.

  • Scrum
  • Operations Early
  • Infrastructure as Code
  • Automate Everything
  • Test Driven Development for Ops (CircleCI, TravisCI, etc.)
  • Integrated Operations
  • Pets vs. Cattle

Over the next few weeks I will cover all of these, and hopefully more. There’s just way too much to type up in one blog post, and all of it is important (I mean there’s entire books written about this stuff, there’s even series of books written about these practices).

Continue reading...

Providing Static IP in front of AWS ELB (Elastic Load Balancer)

So, certain situations can arise (often times security) where you will need static IP addresses in front of your ELBs. Not everyone needs this but a rising number of people are starting to, and I will show you how. The only use case I’ve ran into this is on AWS, not anywhere else. But if it works it works.

Currently there are only two solutions I have found: HAProxy 1.6, and Nginx (goes back to versions farther than 3 years ago). Here’s how

In Nginx:

There is a configuration value called “resolver”, here is an example. It can be used in your  http, server, or location  context.

An in use example:

This will force Nginx to update the DNS for every 10s. In applications I have done we set this to 2s, because it works better for the amount of scaling up and down we do. Since doing this we haven’t seen any failures related to DNS resolution of ELBs.

As a side not, putting it in the http context does have some issues and you may end up with stale DNS data, even with valid=10s, etc.


Continue reading...