Tuesday, May 3, 2011

using Amazon Web Services features to improve EC2 and EBS resource durability

I run the Naval Reactors History Database, a hobby project, on Amazon Web Services resources. This includes a Linux server, persistent storage, and an IP address. I run them out of the AWS East region, which was the region at the center of the recent and significant AWS outage. While my online resource never had an outage, to the best of my knowledge, it's clear that it could have been impacted because of the EBS control plane that supports availability zones across the region. Also, Amazon announced that a small amount of EBS volume data had been lost in the affected Availability Zone.

So, this weekend I spent some time thinking about preserving the work that I've done with my Naval Reactors project. This is in the context of an online database that I'm slowly building, with objects being added and updated as I find the time on weekends and evenings. In short, it's a fairly static resource. In his book on Amazon Web Services, Jeff Barr notes the importance of creating lists. That's what I hope to have out of my own work - a set of lists that I create and can use in order to recover from AWS outages like the one that occurred last month.

So, to begin. First scenario: I am running an m1.small Linux instance in the us-east-1d Availability Zone (AZ). It's possible for me to launch and test a copy of my current server in another East availability zone. All of this work is done in the Amazon Web Services console, so it's quite quick and easy:

1. From Instances: Create an AMI from the running instance (what I'll call the production instance). There is a short (estimate 1-2 minutes) of server downtime as the AMI is generated.

2. From AMIs: Choose to launch an instance from the newly-created AMI. When going through the creation steps, I change the default selection for the AZ and choose to run the new instance in us-east-1b. I choose to keep the same Key Pair Name and Security Group as I have for the production instance.

After launching the instance, I have an EBS-backed Linux instance running in us-east-1d (production) and an instance running in us-east-1b (backup). The AZs have independent power and network connectivity. While the incident report describes how problems in one AZ can potentially impact others in the region, having this server running in another AZ provides a resource backup and a method for bringing my online database back online in the event of an outage.

3. Using the public DNS address, I test access to the Tomcat-based Naval Reactors History Database on the backup server - with success.

Note: I didn't generate an Elastic IP address for this instance. First, it wouldn't make sense in the context of my use - I would use the Elastic IP address currently mapped to the production server and would map this address to the backup server in the event of an outage. But second, you should be aware that you will be charged for an unused Elastic IP address that you've allocated to your account.

4. Stop the EBS-backed backup server instance.

Result: production server running in AZ us-east-1d; backup server stopped, but ready to start and serve resources, in AZ us-east-1b.

I performed all of these steps successfully today.

---

This is one method of providing redundancy. But I want to come up with something a little more sophisticated - in part because I am interested in moving to a new server OS and more robust EC2 platform in the future. Here's a second scenario, in which I build a new instance to host the collection and attach a volume with the needed data to it - all in a different AZ than the production server is running in.

Steps 1-5 and 8-9 below are performed in the AWS Management Console - including step 5, which I'll comment on later.

1. From Volumes: Create a snapshot from the production server's EBS volume.

2. From Snapshots: Create a volume from the EBS snapshot. Again, since the production server is running in us-east-1d, I create the new volume in us-east-1b.

3. From AMIs: Find the right AMI for the future production server. In my case, I'm looking for a Linux OS AMI that I'm comfortable with, preferably with Apache Tomcat preloaded.

4. From AMIs: Launch an instance using the AMI found in step 3. I'll be mounting the volume created in step 2, so I will manually set the instance's AZ to us-east-1b.

5. From Volumes: Here, I will attach the volume created in step 2 to the instance launched in step 4.

6. In the new server's Linux OS, Create the mount point location and mount the attached volume:

mkdir /mnt/prodata
mount /dev/sdf /mnt/prodata

7. Copy the XTF and Naval Reactors History Database files from the just-mounted volume to the Tomcat location on the new production server.

8. After testing, use this server as the new production server and map the Elastic IP address to the server.

9. Stop, and terminate when comfortable doing so, the previous production server.

I had initially planned to perform step 5 using the EC2 command line tools, but it's vastly easier to use the AWS Management Console.

---

My conclusions: The second method provides an important foundation for ensuring the durability of my EC2-hosted online collection. Amazon's detailed report on last month's outage includes this statement: "For example, when running inside a Region, users have the ability to take EBS snapshots which can be restored in any Availability Zone...."

I'm still exploring how to best automate the process of creating snapshots and restoring a volume in an AZ that's different than the production server is running in. What I have, as a primary protection at this point: From the second procedure, my production service running in one AZ and an EBS volume containing my application and online collection data has been restored and is available for use in a second AZ.

Also, I plan to do more reading on AWS best practices. I'm sure that I can improve upon the above procedures, but this is what I came up with based upon my current knowledge.