Sunday, February 23, 2020

Why consistent tooling matters

I have always known that there is a lot of tribal knowledge within the team that I am a part of. I also understand that the silo'ed knowledge is undesirable, but until recently I did not really see the extent where this can be problematic. We become too comfortable with going through inefficient workflows to solve problems because that is what we are used to. Tooling could be built to make the workflows more efficient, but the cost in time to build out the tools compared to going through the sub-optimal workflows was not worth paying in the moment. It was much easier to just train people on the nuance of how the systems worked and then demonstrate the sub-optimal way to complete the task. However, this methodology fell apart and was not scalable as it became necessary to aggressively on-board new engineers.

Once it became a requirement to on-board product support engineers to the product that I was helping maintain, I was essentially starting from the ground up with a bunch of new hires for a geographically dispersed team. The existing product support team members are tech savvy, but the technology that I had been supporting works very different from anything that they had worked with previously. As I was doing some of the training and explaining how to troubleshoot cases I would often get the question "How do you know to look at x system when y problem happens?". I was really bothered by these types of questions because these were largely undocumented Product quirks that the the existing Support team and myself would stumble upon and remember the pain of troubleshooting the low level mechanics of what could be causing x issue.

It is important for team members to all to use the same methodology when troubleshoot similar cases. The lack of tooling created a per-requisite to understand the low level details of systems before they could even begin to understand the approach to solving technical issues. Without consistent tooling there was a wide variety of process and methodology for solving cases. Everyone was left to their own devices to figure out the what they could see as the best way to solve cases. Useful tooling does a few things:
  1. Provides mechanisms for troubleshooting specific issue
  2. Provides a shared language for how issues have been troubleshot and what troubleshooting to do next.
  3. Reduces on-boarding time
I will go through each of the 3 items above since they require their own more in-depth explanation.

Provides mechanisms for troubleshooting specific issue

PKI is very large topic with many complex pieces. It is a common task for Support Engineers to check the validity of a certificate. My preferred way of checking the validity of a certificate is to use tooling in Openssl. Below is the function for reference:

function checkcert {
echo "---------------------------"
echo "Certificate Valid Dates"
echo "---------------------------"
true | openssl s_client -connect $1:443 2>/dev/null | openssl x509 -noout -dates
echo "---------------------------"
echo "Certificate CN and DNS Info"
echo "---------------------------"
true | openssl s_client -connect $1:443 2>/dev/null | openssl x509 -noout -text | grep DNS:
echo "Issuer"
echo "---------------------------"
true | openssl s_client -connect $1:443 2>/dev/null | openssl x509 -noout -issuer
echo "---------------------------"
echo "Check Hostname Validity"
echo "---------------------------"
true | openssl s_client -connect $1:443 2>/dev/null | grep 'Verify return code'

If a support engineer has to validate a certificate, then it is as easy as running checkcert from the command line to check the validity. I am assuming that SNI is not a factor for this example. There is no need to open a browser to grab a screenshot of text. The output of the command already provides the most relevant information for identifying the validity of a certificate in an easy to consume and share format. Having information formatted in a way that is easy to consume is a critical part to the next point.

Provides a shared language

IT systems are complex and the roles between organizations are highly specialized. Having a clear and consistent way of communicating between people will result in issues being fixed more quickly. Using similar tooling builds trust and confidence when transitioning work items from individual to another in a team. Also, there is less of a need to explain the steps that have been taken so far to troubleshoot a behavior when methodology that is followed is consistent among a team. Just like a real language, the shared and consistent follow through can feed into how well interaction go with external teams as well. For example, if it is expected that logs will be formatted a certain way when requests are sent to the Product team, then tooling can be built to accommodate this need. By building out the tooling to format logs in a more consumable way, this reduces the amount of time needed to manually do the formatting and also reduces the likelihood that the incorrect format will be used. This is especially important if using the incorrect format may cause delays by unnecessary back and forth on the tickets.

Reduces on-boarding time

The previous points really feed into this last item, which I believe is the most important. Turnover is a part of any company and no one should attempt to completely eliminate turnover. Instead, focus on creating a great on-boarding program and reduce the amount of time necessary for someone to become proficient at their key responsibilities. For example, if I told an entry level new hire "Openssl is the only acceptable way to do certificate validation and if you have questions then check out the 'man' page", then I am failing as someone responsible for on-boarding. This is because the goal is to do certificate validation, not become an expert at Openssl (which is still a good skill to have). We have to think about how to abstract the low level details where it makes sense so that more time and energy may be spent on hire level tasks that add value to the company and customers.

I hope you enjoyed the post. Please leave questions and feedback in the comments below.

Saturday, September 15, 2018

AWS ALB Failover with Lambda

Abstract The purpose of this article is to go over the mechanisms involved to automatically use an alternative target group for an ALB in the event an ALBs existing target group becomes unhealthy. ALB failover with Lambda Lambda is amazing for so many reasons. There are a few things that you must have in place to perform failover with an ALB.
  1. Primary Target Group
  2. Alternative Target Group
And that is about it. Everything is fairly straight forward. So let's start. I have an ALB with the following set up with the following rule set up.
 Screen Shot 2018-08-15 at 2.04.44 PM

For the magic of automatic failover to occur, we need a lambda function to swap out the target groups and we something to trigger the lambda function. The trigger is going to be a Cloudwatch Alarm that sends a notification to SNS. When the SNS notification is triggered, the lambda function will run. First, you will need an SNS Topic. My screenshot already has the Lambda function already bound, but this will happen automatically as you go through the Lambda function set up. Screenshot of SNS Notification.
Screen Shot 2018-08-15 at 2.21.27 PM

Second, create a Cloudwatch alarm like the one below. Make sure to select the topic configured previously. The Cloudwatch Alarm will trigger when there are less than 1 healthy hosts. Screenshot of the Cloudwatch Alarm [gallery ids="436,437" type="rectangular"] Third, we finally get to configure the Lambda function. You must ensure that your lambda function has sufficient permissions to make updates to the ALB. Below is the JSON for an IAM role that will allow the Lambda function to make updates to any ELB.

"Version": "2012-10-17",
"Statement": [
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "elasticloadbalancing:*",
"Resource": "*"

The code below is intended to be a template and is not the exact working copy. You will have to update the snippet below with the information needed to work on your site.

from __future__ import print_function

import boto3
print('Loading function')
client = boto3.client('elbv2')

def lambda_handler(event, context):
response_80 = client.modify_listener(
# This is the HTTP (port 80) listener
ListenerArn = 'arn:aws:elasticloadbalancing:region:id:listener/app/alb/id/id',
'Type': 'forward',
'TargetGroupArn': 'arn:aws:elasticloadbalancing:region:id:targetgroup/id/id'
response_443 = client.modify_listener(
# This is the HTTPS (port 443) listener
'Type': 'forward',
'TargetGroupArn': 'arn:aws:elasticloadbalancing:region:id:targetgroup/id/id'
except Exception as error:

Screenshot of Lambda function settings. Screen Shot 2018-08-15 at 2.33.56 PM

After putting it all together. When there are less than 1 health target group members associated with the ALB the alarm is triggered and the default target group will be replaced with the alternate backup member. I hope this helps!


Troubleshooting TCP Resets (RSTs)

Inconsistent issues are by far the most difficult to track down. Network inconsistencies are particularly problematic because there can often be many different devices that must be looked into in order to identify the root cause. The following troubleshooting goes through a couple of steps. The first part is to start a tcpdump process that will record TCP RSTs. Then you can send a lot of HTTP requests. Below is the command to issue the tcpdump and fork the process to the background. However, the output will still be sent to the active terminal session because of the trailing &.

sudo tcpdump -i any -n -c 9999 -v 'tcp[tcpflags] & (tcp-rst) != 0 and host' &

Below is the command to issue lots of HTTP requests. The important part to understand about the below command is to go through the TCP build up and tear down that happens during the HTTP request process.

for i in {1..10000}; do curl -ks > /dev/null ; done

Below is an example of what a potential output could be.

17:16:56.916510 IP (tos 0x0, ttl 62, id 53247, offset 0, flags [none], proto TCP (6), length 40) > Flags [R], cksum 0x56b8 (correct), seq 3221469453, win 4425, length 0
17:17:19.683782 IP (tos 0x0, ttl 252, id 59425, offset 0, flags [DF], proto TCP (6), length 101) > Flags [R.], cksum 0x564b (correct), seq 3221469453:3221469514, ack 424160941, win 0, length 61 [RST+ BIG-IP: [0x2409a71:704] Flow e]
17:18:54.484701 IP (tos 0x0, ttl 62, id 53247, offset 0, flags [none], proto TCP (6), length 40) > Flags [R], cksum 0x46f7 (correct), seq 4198665759, win 4425, length 0

While it may be unclear exactly why the TCP RSTs are happening this does provide a mechanism to reproduce TCP RSTs behaviors to investigate on other devices in the Network traffic flow. Below is documentation on how to troubleshoot TCP RSTs for the F5.

 Happy troubleshooting!

Grabbing AWS CloudFront IPs with curl and jq

There's times when you want to restrict access to your infrastructure behind CloudFront so that requests must go through the CloudFront CDN instead of your origin directly. Fortunately, AWS lists their public IP ranges in a JSON format in the following link, However, there are a lot of services in the above link and it would be very tedious to take the entire JSON and read through it to grab specific CloudFront IP's. Using the combination of command line tools curl and jq we can easily grab just the CloudFront IP ranges to lock down whatever origin that exists. Below is the command that I've used to grab just the CloudFront IP's. Enjoy!

curl | jq '.prefixes | .[] | select(.service == "CLOUDFRONT") | .ip_prefix'