I need to get something off of my mind. It seems a lot of people out there, in many organizations seem to think that “cloud” is just a word still thrown around, and that cloud migrations will likely fail.
I am here to tell you it’s very possible, and doable, and could actually work better than your current system. First I need to make sure everyone understands the cloud. Odds are once you are migrated, if you follow these rules your app would perform better and run better.
- Cloud migrations aren’t just an Operations thing
- Plan for everything to fail
- Centralize everything
- Databases and other resources that don’t scale require clustering
- What are your pets, and what are your cattle?
- Load Test
These topics are so very important, and if you fail to do them you will have serious problems down the road. I’ve done enough applications in the cloud to understand that if you skip certain things you will have problems.
Cloud migrations aren’t just an Operations thing
For some reason many companies, and individuals believe that it’s an operations thing to move to the cloud. It’s not just an ops thing. I am here to convince you why it’s not. When you write an application we expect our dependencies to work, and stay working, we don’t expect them to fail. Here is where we have a problem. It will fail. Dependencies always fail, that 3rd party API you talk too will probably go down at some point. What happens if it does? I’ve seen applications literally crash because they can’t connect to another resource. When I mean crash, I mean hard crash, needing a restart of Tomcat, Java, Apache, Nginx, whatever it is. IMO, it’s the dumbest thing that can happen.
How do we attack this problem? First we talk to the developers. Does the 3rd party resource provide a “i got your message”, “now here is your response”. If it fails (we cant connect/server error), lets try sending it agian (fail quickly, retry often). Now if that resource fails, instead of crashing the entire app, what if we instead log the error, and allow the rest of the application to function correctly. If the resource is unavailable it’s ok to tell a user, “hey, this action has failed, and has been logged, operations has been notified”. If you centralize your logging and monitoring hopefully you have your logs sending alerts on failures above a percentage. Splunk and ElasticSearch are great tools for this.
Every company I have worked in has this issue. It’s just oversight in development. Developers normally expect things to work, in the cloud you need to expect that they won’t always work. It’s a complete mind change, but in the end it will give you a better performing service outside of the cloud. Remember, you aren’t down if it’s not yours. Its easier to say hey, Google Maps API is down, we already contacted them and they are working on it, than saying Google Maps is down, and our app is entirely down because of it. People often can deal with a feature being unavailable than your entire application being offline.
On a side note from experience, I’ve seen a minor dependency of an Authentication API actually crash the entire Authentication API which then crashed over 150 applications….I mean that dependency was only used by barely 5% of total users of the Authentication API. Cascading outages are common, and little issues can blow up into big ones without doing better error handling.
Plan for everything to fail
I will tell you now, you will have servers fail. They will die. They will die a horrible horrible death, one that you can possibly lose anything on those servers. You can try to save them, but sometimes there’s nothing you can do. Because of this, the wonderful people at Netflix years ago created a tool called ChaosMonkey (http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html). This tool will actually mimic failures of Availability zones, hell, you could have it bring down a region if you wanted to.
You need to monitor everything. When I mean everything, I mean all of your servers, all of your applications health status’s, and your application dependencies (third party API’s, etc.). If you are only in AWS, I’d recommend CloudWatch, it works great, you can Autoscale, and set policies, as well as integrate with SNS which can notify you on errors, and actions performed by your hopefully future “automated cloud”. If you must you another tool, learn about your tools APIs. Monitoring tools now a-days have RESTful APIs. If your tool does not, complain to the creators cause they are way behind.
You servers will fail, it’s a fact. You need to have a way to recreate them, and I mean all of them. Don’t build an application without a way to rebuild a server. This is where Configuration Management is extremely valuable. Ansible, Salt, Puppet, Chef, CFEngine, it doesn’t matter, as long as you can use it to rebuild servers from scratch. image based systems work. For example, for AWS Autoscaling all you do is deploy your app to an instance, then build an image from that instance, which is then used by Autoscaling to create the new hosts, all of which are exactly the same.
Look, I know how you love logging into servers and doing that tail -f /var/log/myapp/my.log stuff, but we need to move on. Centralize your logging, what would you do if that server died and you can’t connect to it anymore, that means those valuable logs are lost. They are gone. 🙁 Now your analytics team has bad data. You could of missed some errors that occurred. So centralize your logging, it just works better, and if your logging is slow, scale it correctly so it’s not, of course you may have to tune your agents or tune your Syslog on the server. There’s tons of books on it, and you can really do magic with that data when it’s centralized.
It’s also important to run your deployments from a single location. I have no idea why there are people out there that still think deploying from your desktop is acceptable. You basically make that persons computer a fail-point in the deployment process. How does it feel to tell the Product Owner that they can’t deploy the application to production because the Ops guy’s computer is down. Create a jump server, and cluster it, or create 2 jump servers, make sure scripts are replicated, either by Git pull on both servers, or something. If one dies, you can still do your deployment, and if the ops guy is fired, the process can easily continue. This also helps in securing your environment, meaning I can make it so only those two deployment servers have access to your servers and infrastructure. This also is a huge thing for PCI and Security Auditing, and will save you a little embarrassment from the security team, or your auditor.
Configuration Management, when centralized also helps. It will help manage those MySQL servers, etc, and keep access, and everything else consistent. You don’t need autoscaled, and “cattle” servers using centralized configuration management. Actually, it’s a terrible idea, to use centralized configuration management on your “cattle”. It will lead to complications on scaling, and authenticating your “cattle” to your centralized configuration management. But you can use configuration management to build the initial image or based used for your “cattle”.
Centralize your scripts. Put your scripts that are used for deployments into repositories. Subversion, Mercurial, Git, CVS, GitHub, any of them work. This helps you know where to find your scripts instead of looking around on your servers, it also offers the ability to do peer review on code committed and changes to those scripts.
Databases and other resources that don’t scale require clustering
We all have databases, we love databases, the only problem is they need to be taken care of. In the cloud, unfortunately we can’t guarantee that the server hosting that database will be up. So luckily everyone has already came up with solutions to that problem. We can now cluster applications, do master-master, and master-slave setups. <<< DON’T SKIP THIS. In AWS we can do special instances that have 10GbE links to handle fast replication of databases if needed.
Some servers and monitoring tools offer clustering. Use it. Pacemaker and Corosync are great tools to help. You can cluster Nginx and Apache servers using Pacemaker and Corosync.
One of these servers may go down, and to save you the panic, it’s always great to have another instantly take charge.
What are your pets, and what are your cattle?
Pets are your servers you need to take care of, cattle is all the servers that are the same, usually serving up your app. You need to identify these. What servers can you treat like cattle? The more cattle you have the easier it is to scale them, and grow. Pets require more time, and not many of us get the benefit of time…for some reason everyone seems understaffed.
How do you determine this….well I can make it easy at least for some of your servers. If you have a database, for the most part those are pets. However times are changing, maybe someone has a project out there that allows you to simply change a number and boom, you have another replica to your DB. Also monitoring tools like Zabbix, that unfortunately you can’t have duplicate servers running at the same time, so you need to cluster it. That’s considered a pet. You need to take care of it, maintain it. Any kind of server that you can’t treat like cattle is a pet. Easy way to do this is write all the components down go through each one and say “can I treat it like cattle.”
For some reason people think hey, lets just choose this size for our instances, and that’s what they use. I don’t know where this idea came from but it’s the most ridiculous and non-efficient way to develop an application. So what’s the proper way to choose an instance type/size. Well we test, and test, and test. When the app is being developed and in testing, try different instance sizes. A java application that you think needs a lot of memory might actually need more CPU because of garbage collection. This is stuff you’d only find out if you load test your boxes. Work with your QA team to do these tests. Choosing a random instance based on what you believe will be right is not efficient because it’s untested. Also, when you are testing figure out what your minimum requirements are during load testing. Do you need to have 8GB ram on a server? How does it run with 4GB? Pick a good size you can stick with. Vertical scaling is very difficult to manage when compared to horizontal scaling. Horizontal scaling can literally be done by changing numbers, while vertical scaling requires a shutdown and restart of your servers. It also gives you a perfect measurement on what your devs test environments should look like. Often times developers will ask “can we add more ram?”, “can we add more cpu”. Don’t just do it for them. There needs to be a serious reason to change. When you are managing cost effectiveness in the cloud you need to understand that if you can guarantee you will use a single type of instance for your application, you can save 50% to 60% on instance costs with reserved instances. Changing the instance size makes it impossible to do this in your budget. Thus you will spend 50% to 60% more on AWS. Also, use math to calculate total traffic and users on your site. With a few servers see how much traffic it can support. Then talk to the teams to find out how much traffic is expected in staging, and how much is expected in production. Use math to figure out how many servers you will need to support that traffic. Later when you are in prod and have some good base-line metrics you can creating scaling mechanisms to handle growth and decreases in traffic. Scaling will help you keep costs down.
There is a lot to making cloud migrations, and cloud infrastructure work well. But the results can be amazing, and with APIs and automation you can manage massive infrastructures with minimal effort and support. I am sure there is more I can add, these were just key points I wanted to make.