While working with a project that utilized an Amazon Linux NAT (network address translation) instance for outbound connections in redundant availability zones, I realized
that a single NAT making egress requests for two Availability Zones (AZs) introduces
a single point of failure.
A
whitepaper written by Jinesh Varia outlines the steps required to
implement a two-way monitoring high-availability (HA) failover NAT solution. He
provides a script with a guide on how to replace necessary variables to give
each NAT instance visibility of each other.
However, I wanted to provide insight into an issue I faced
while testing this configuration. I found that when I would stop a NAT
instance, it would not restart. The next steps were to see if any intended
routing failover was occurring, which in fact was. Using the script below I was
able to see the logs of nat_monitor.sh :
tail /tmp/nat_monitor.log
Troubleshooting led me to find out that the active NAT
instance was unable to see the downed instance’s state due to two reasons.
The first is that the instances only had public IPs, not Elastic
IPs (EIPs). EIPs stick to the instance when the instances are turned off and
are visible to the API, so when you are making calls to a box that is turned
off, you are still able to communicate with it.
The second is specified in a notation Jinesh made in his
whitepaper (in Step 7). He makes it clear that the script works with tools
version 1.6.12.2 2013-10-15. He points
out that if NAT_STATE isn’t updating, then to change "print $4;" on
line 77 to "print $5;". It's because different versions of the tools
output the ec2-describe-instances differently. Here’s the original Line 77:
NAT_STATE=`/opt/aws/bin/ec2-describe-instances
$NAT_ID -U $EC2_URL | grep INSTANCE | awk '{print $4;}'`
This article was written January 30, 2014, and the tools
have since been upgraded. While opening up a ticket with AWS to assist in
troubleshooting the script’s output, the support engineer recommended to change
“print $5”; to “print $6”; and the change produced the outcome I’d been seeking.
The script uses the API to see if the NAT box is
"stopped". If it is, then it will start it. If it's not stopped, it
will try to stop it and then loop back to the previous attempt to start it.
You will be able to successfully test this functionality by
stopping an instance within the console and observe it restarting automatically
after the threshold in the nat_monitor.sh configuration is met.
Rory Vaughan is a Cloud Engineer with JHC Technology. He can
be reached at rvaughan
(at) jhctechnology.com or connect
with him on LinkedIn.