While architecting your system, you could take a few example scenario’s and see how your architecture is protected against them. Every component needs to be automatically replaceable, lets see how this plays out.
With every application server being in a cluster, and strictly monitored by the Internal Load Balancer, this is not a problem. With a predefined interval, the ILB will perform a little health check on each instance. When it times out, the instance will be destroyed and replaced by a new one. Launching a new instance with either a preinstalled AMI or a EBS hard drive snapshot.
Within each datacenter, there are multiple Availability Zones (AZ). You can look at them as a datacenter within a datacenter. It operates as a standalone subset, which makes it highly redundant in case one zone goes down, the rest will still operate normally. You must always have your architecture running in at least two AZs. When a complete zone goes down, it can easily scale up the other zone to handle the load. Be carefull though, when you redirect the load to the other zone, without having enough computing power ready, it may crash. Be sure you have enough capacity in the new zone.
Every EC2 node uses Elastic Block Storage (EBS). The service that provides flexible hard drive capabilities. You can easily create snapshots, create or destroy drives and attach them to instances. Make sure your nodes are stateless and always have an up-to-date snapshot to launch new instances off. As you nodes are clustered and stateless, each drive is primarily the same, which makes them easy to replace. The state of your application (static files, database, uploaded media etc.) is stored in S3 or replicated.
It may happen that one of your nodes get compromised and someone infiltrades your architecture. At this point you are extremely vulnerable. Happily you could monitor suspicious activity as it happens. Nodes have repetitive behaviour which makes it easy to monitor nodes that suddenly request data it normally never needs. Be sure to use the build-in IAM (Access Management) for instance permissions. This will block a node from accessing anything other than its own data. When a node is behaving funny, you can either do two things. Terminate it and launching a new one, risking the chance that a hacker will easily access your architecture again. Or you create a small honeypot VPC and place the node in it. Giving the node no permissions to access anything outside the VPC you can monitor its behaviour. Now you can check the log files and see how someone entered your architecture and fix the problem.
Speaking of a worst-case-scenario. A complete datacenter with all Avalibility Zone’s may crash. This recently happend when a storm hit the biggest AWS US datacenter. Taking down 8 AZ’s and with that, a big part of the internet. When using Route53 you can easily turn on latency based routing. This will locate the nearest operating datacenter and redirect requests to it. DNS records have a minimum TimeToLive of 60 seconds, given that your internet service provider doesn’t cache these. You MUST always save the state of your application in another datacenter, given that your application doesn’t already run in multiple datacenters. It may take a few minutes to re-launch your architecture from the dump files but its better than hours of downtime.