1 of 8

Managing AWS

See the README.md in CityOfBoston/digital-terraform.

Production Overview

Introduction to how the Digital webapps infrastructure is set up in AWS

The Digital team uses Docker via Amazon’s Elastic Container Service to deploy its webapps. We migrated to AWS from Heroku primarily so we could establish a VPN connection to internal City databases (such as those used for boards and commissions applications and the Registry certificate ordering).

Our production cluster runs two copies of each app, one in each of two AZs. This is more for resilience against AZ-specific failures than for sharing load.

Almost all of our AWS infrastructure is described by and modified using our Terraform configuration.

Why Docker?

The webapps that the City has developed so far are extremely small and low-traffic. Docker containers let us pack a few machines with as many webapps as we can; right now we’re limited only by memory. Docker keeps these apps isolated from each other. It also makes it easy to do rolling, zero-downtime deployments of new versions.

The typical limitations of Docker (stable storage is a pain, as is running related processes together, loss of efficiency for high loads) are not concerns for the types of apps we’re building.

Amazon’s ECS, along with its Application Load Balancers, handle restarting crashed jobs and routing traffic to the containers.

Network Configuration

Our app containers are run on EC2 instances that live in four private subnets (2 AZs × 2 environments). These instances do not have public IPs and therefore cannot communicate directly with the public internet, which gives us some level of safety through isolation.

These ECS cluster instances receive traffic from Amazon’s ALB load balancers, which live in corresponding public subnets. They can contact public web services through NAT gateways, which also live in the public subnets. The ECS cluster instances also have access to internal City datacenters through the VPN gateway.

The instances are further isolated by having security groups that only allow traffic from the security groups of their corresponding ALBs (and SSH traffic from the bastion instance).

VPN Gateway

The VPN gateway connects from our VPC to the City datacenter. It has two connections running simultaneously for redundancy. AWS VPNs need to have regular traffic to keep them active, and if they do disconnect they need traffic from outside AWS to cause them to come back online.

We have a SiteScope rule set up with the CoB network team that pings an EC2 instance inside of our VPC. (Currently this EC2 instance does not seem to be created via Terraform.) This rule does a ping every few minutes, which keeps traffic running on the connection and also will bring it back up if it does go down.

Additionally, we have a CloudWatch alarm that fires if one or both of the VPN connections goes down. If one has gone down traffic should still be flowing over the other, and usually it will come back up of its own accord. Contact NOC if there are issues.

SSH Access

In general, you should not need to SSH on to the cluster instances. Definitely not for routine maintenance (do that through an ECS task if you need that kind of thing). It may be necessary to troubleshoot and debug issues, however.

Instructions for how to SSH on to our bastion machine using an SSH key loaded into your IAM account, and from there how to SSH on to a cluster instance, are in the digital-terraform’s README.md file.

AWS Bastion Access

To access the AWS resources (e.g. EC2 devices) you first need to SSH into the AWS environment.

You can access the SSH Bastion from the City Hall network (140.241.0.0/16) if you have an SSH key on your AWS account and are in the SshAccess IAM group.

Step 1: Set up access to the bastion

Request an AWS Admin to add you to the SshAccess IAM group.
From the IAM console, upload a public key for your account
Edit your /etc/hosts to add the following line: 35.169.164.239 apps-bastion
Initialize your account on the bastion by SSHing without a public key: ssh -o PubkeyAuthentication=no <username>@apps-bastion Note: your bastion username is the bit before @boston.gov on your account name.
Control-C out when it asks for a password.
SSH in with your public key: ssh -A <username>@apps-bastion (the -A forwards the SSH agent, which is important for SSH'ing on to the instances.)

Step 2: Setup access to the EC2 instances

From the Bastion, you can get to the EC2 instances which host the ECS services.

Request that the AWS Admin share the ec2-user private keys and passwords with you via dashlane. There are 2 keys one for production and one for staging. Save whichever you need, or both, into your ~/.ssh folder.
Ensure the permissions on the private key file/s are set to 600 (chmod 600 xxxx)
Note the Private IPv4 address of the EC2 instance from the EC2 instances page in the AWS console - this will be 10.40.15.x for staging and 10.40.115.x for production. There are 2 production instances, you can use either. These IPAddresses change after each deployment, so check regularly.
Once you have successfully SSH'd onto the bastion (#6 in Step 1 above), you will be able to ssh onto the instance ssh ec2-user@<ipaddress>

Step 3: Access a running container

Once you’re on a container instance (#4 step 2 above), you can use docker commands to inspect containers

for example some useful commands are:

docker stats     --> provides realtime container performance stats
docker ps        --> provides a list of currently provisioned and running containers

docker exec -it <containername> /bin/sh  --> shell access (like ssh) a container

docker stop <containername>           --> will stop a container, 
                                          auto scaling should cause it to restart
                                  
docker logs --follow <containername>  --> view and tail stdin and stderr

??? Outside of the containers, the ec2-user account can use sudo -s to open up a shell with root access.

Terraform

The Digital Team use terraform to manage the AWS configuration.

Terraform is a CLI utility synchronizes AWS with scripts. In essence, it uses a series of scripts to detect and make changes to AWS. Terraform commands are run from a terminal session on a machine with Terraform libraries installed. See website: See documentation

Installing Terraform

Terraform installation instructions are here, and tutorials are here.

Terraform AWS provider info can be found here.

Once Terraform is installed, clone a local copy of the digital-terraform repository and switch to the production branch.

Terraform commands (typically plan, apply and import) can be run from a shell from the cloned repo/apps folder .

Updating the ECS cluster AMI

How to update the AMI on our ECS cluster instances

The Digital webapps cluster uses the Elastic Container Service on AWS. We have a handful of EC2 instances that actually host the containers.

These instances use a stock Amazon Machine Image (AMI) from Amazon designed for Docker that comes with the ECS agent pre-installed. From time to time, Amazon releases a new version of this “ECS-optimized” image, either to upgrade the ECS agent or the underlying OS.

Thanks to our instance-drain Lambda function, updating the cluster EC2 images is a zero-downtime process. Nevertheless, it’s best to run this during the weekly digital maintenance window, and make sure that staging looks good before doing it on production.

This process is sometimes referred to as “rolling” the cluster though it’s more accurate that we set up a second cluster of machines and migrate to it.

Performing the AMI update from GitHub

Find the latest ID for the ECS-optimized AMI. You can do this on the Amazon ECS-optimized AMIs page.
In a browser navigate to CityOfBoston/digital-terraform repository and edit the apps/clusters.tf file.
Update the instance_image_id value for the staging_cluster module to the new AMI ID from step 1 above. Save/commit the file as a new branch, not directly to the production branch.
Make a PR which merges the new branch into the production branch, and assign a person to review the changes.
When you make the PR, GitHub will automatically execute an atlantis plan process (see what atlantis is). When the plan is done, inspect the output and expect to see changes to: - resource "aws_autoscaling_group" "instances" - resource "aws_cloudwatch_metric_alarm" "low_instance_alarm" - resource "aws_launch_configuration" "instances" (This last one will have the new AMI guid) Any other changes the plan identifies should be carefully investigated. Terraform may be proposing to make changes to the AWS environment you don't want, or at least are not expecting.
After viewing the plan, if you need to update the terraform scripts, be sure to save the changes to the new branch. If comitting your changes does not trigger the atlantis plan automatically, you can run it manually by creating a new comment with atlantis plan.
Once the atlantis plan is finished, and the PR has been approved, create a new comment atlantis apply. This will cause Atlantis to apply changes to AWS. (Atlantis runs a terraform apply command in a background process). See what happens.
Keep an eye on the “ECS Instances” tab in the cluster’s UI. You should see the “Running tasks” on the draining instance(s) go down, and go up on the new instances.
Once all the tasks have moved, the old instance(s) will terminate and Terraform will complete. Check a few URLs on staging to make sure that everything’s up-and-running.
Now that Atlantis’s apply finished, you can merge the staging PR and repeat the process (steps 2-6) for the production cluster.

Performing the AMI update using Terraform CLI

If you have terraform installed on your local computer, you can do the update directly from your computer.

Find the latest ID for the ECS-optimized AMI. You can do this on the Amazon ECS-optimized AMIs page.
Ensure your cloned copy of the digital-terraform repository is on the production branch, and that the branch it up to date with the origin on GitHub.
Create a new branch from the production branch.
In your preferred IDE open the /apps/clusters.tf file and update the instance_image_id value for the staging_cluster module to the new AMI ID from step 1 above. Save/commit the file to the new new branch (not directly to the production branch).
in a terminal/shell from the repo/apps/folder, run the command: terraform plan
When the plan is done, inspect the output and expect to see changes to: - resource "aws_autoscaling_group" "instances" - resource "aws_cloudwatch_metric_alarm" "low_instance_alarm" - resource "aws_launch_configuration" "instances" (This last one will have the new AMI guid) Any other changes the plan identifies should be carefully investigated. Terraform may be proposing to make changes to the AWS environment you don't want, or at least are not expecting.
Once you are happy with the changes that terraform will apply to the AWS environment, you can run the command: terraform applySee what terraform apply does.
Keep an eye on the “ECS Instances” tab in the cluster’s UI. You should see the “Running tasks” on the draining instance(s) go down, and go up on the new instances.
Once all the tasks have moved, the old instance(s) will terminate and Terraform will complete. Check a few URLs on staging to make sure that everything’s up-and-running.
Now that terraform's apply is finished, you can repeat the process (steps 2-9) for the production cluster.
Finally you should merge the changes in your new (local) branch into the local production branch, and then push the your local production branch to the origin in Github.

After the production instances are fully up, check that they have roughly equal “Running tasks” numbers. ECS should schedule duplicate tasks on separate machines so that they are split across AZs. If you see a service has both of its tasks on the same instance you can run a force deployment to restart it. (See Restarting an ECS service)

What are Atlantis and Terraform ?

Atlantis provides a GitHub provisioned wrapper for Terraform: it runs terraform plan and terraform apply commands from GitHub and posts the results back to GitHub. See website | See documentation Atlantis is a small application which Digital team have installed on a very small serverless environment in AWS (fargate). It runs in fargate because it restarts the staging and production containers and therefore cannot run on any of the main EC2 instances.

What happens during a terraform/atlantis apply with an updated AMI?

When the AMI is updated the terraform plan command will:

create a new Launch Configuration for cluster instances (i.e. EC2 instances) that uses the new AMI,
create a new Autoscaling Group that uses the Launch Configuration,
trigger deletion of the old Autoscaling Group.

The instance-drain Lambda function will tell ECS to drain the tasks from the instances that are being shut down (terraform won’t delete the Autoscaling Group until its instances are fully terminated). ECS will automatically start those tasks up on the new instances that got created by the new Autoscaling Group.

Restarting an ECS service

How to restart an ECS service when you change its configuration.

When you update a service’s configuration in S3 you’ll need to manually restart it to pick up the file changes. Because we do rolling ECS updates, you can do this without dropping traffic.

Prerequisites

You will need to have an AWS Console account.

Steps

First, visit the ECS page on AWS and choose your cluster (AppsStaging or AppsProd).

Then, click the checkbox next to the service you want to restart and press the Update button.

Don’t touch any other settings, but make sure to click the Force new deployment checkbox. That will start up new containers, even though the code hasn’t changed from what’s currently running.

Click Next step through all of the screens, and then click Update service.

Navigate to the service’s “Events” tab and keep an eye on things. You should see it start new tasks and eventually deregister and stop the old tasks. Once it says “…has reached a steady state” again then you know things were successful.

Encrypting service configuration for S3

How to encrypt and use a .env file for a service hosted on S3.

Apps are configured by putting files in a secure configuration bucket on S3. The ENTRYPOINTscript for our apps pulls all files in from the app’s path in the bucket before starting up. This allows an app to be securely configured with a .env file and, for example, server.crt and server.key files for TLS connections.

Prerequisites

An AWS CLI login (this is different from the AWS account you can log into via the web)
kms:Encrypt permissions
The AWS Command Line Interface package installed and configured

Encrypting Environment Variables

Though .env files are stored encrypted in S3 and only transferred securely, we still encrypt environment variables like passwords so that they are not seen in plain when editing the .envfiles.

Each service is configured to have its own private key in the Amazon KMS keystore. Only the task role may decrypt with that key.

Adding the _KMS_ENCRYPTED suffix to an environment variable’s name in the .env file will cause the task to decrypt the variable at runtime, storing it in process.env after stripping the suffix.

To create an encrypted environment variable value:

Visit the “Customer managed keys” section of the KMS part of the AWS web console.
Look for the Key ID for your service. Save it in the $CONFIG_KEY_ID environment variable.
Log in to AWS CLI with an account that has kms:Encrypt permissions for the key.
Run the following command with a leading space so that it doesn’t appear in your command history: aws kms encrypt --output text --key-id $CONFIG_KEY_ID --plaintext STR-TO-BE-ENCRYPTED
Copy the encrypted value (the output up to the whitespace) as the value in .env: PASSWORD_KMS_ENCRYPTED=…

Note that using --plaintext on the command line will cause aws kms to encrypt the ASCII as-is. When using the fileb:// form to reference a file on disk, aws kms will first Base64 encode the value, which will cause a failure on the app side, which does not expect Base64-encoded values.

Decrypting Environment Variables

Decrypting a variable which was encrypted using the method above is possible using the following commands in a terminal session:

ENCRSTR="AQICAHiwpwrMhuNm...."
aws kms decrypt \
  --ciphertext-blob fileb://<(echo $ENCRSTR | base64 -d) \
  --output text \
  --region "us-east-1" \
  --query Plaintext | base64 -d

ENCRSTR="AQICAHiwpwrMhuNm...."
aws kms decrypt \
  --profile=cityofboston \
  --ciphertext-blob fileb://<(echo $ENCRSTR | base64 -D) \
  --output text \
  --region "us-east-1" \
  --query Plaintext | base64 -D

If you have multiple profiles on your computer, you may use this option in the aws kms decrypt command:

--profile=myprofile

Mounting AWS SFTP as a Drive (Mac)

Guide to mounting an s3 bucket via SFTP as a drive on your computer

PREP

Check/Create SSH/RSA Keys
1. The RSA Keys should not have passwords, create new keys (without a password) if the current keys the user has were setup with a password
Setup SFTP Account on AWS
1. If you’re not an Admin ask one (David, Phill) to create your account
Add the users SSH/RSA key to their FTP account
Make sure the user you use in your computer is an admin on that computer
1. We’ll need to run commands under `sudo`

SETUP

Download FUSE & SSHFS from https://osxfuse.github.io/
1. Install FUSE
2. Install SSHFS
Restart the computer
Open the Terminal App, can be found in the Applications Folder under Utilities
Check sshfs is install with his command ```sshfs --help```
Create two directories, `mnt/patterns`
1. mkdir ~/mnt
2. mkdir ~/mnt/patterns
Locate the SSH/RSA Keys
1. This is probably at `~/.ssh/`,
Save/copy the users FTP account username (ask AWS admin if you don’t have it)
Try connecting/mounting the Drive with this command, replacing RSA Key and Username with the values from the previous two steps
1. sshfs -o IdentityFile=RSAPublicKeyLocation -odebug,sshfs_debug,loglevel=debug -o defer_permissions -o noappledouble -o volname=patterns username@assets_sftp.boston.gov:/patterns.boston.gov/assets ~/mnt/patterns/
2. If this doesn’t work try ```sftp -o IdentityFile=RSAPublicKeyLocation username@assets_sftp.boston.gov```
3. This should work, if not trouble shoot by looking at the logs from the previous command (#1)
Now you are able to mound it’s time to create a Bash script that will run when the user logs in.
Using a Text or Code Editor create a bash file at ```/Library/Startup.sh``

Copy the code below into the file

#!/bin/sh

USERNAME=sftp_username
RSAPUBLICKEYLOCATION=~/.ssh/id_rsa
MOUNTLOCATION=~/mnt/patterns/

sshfs -o IdentityFile=$RSAPUBLICKEYLOCATION -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3 -o defer_permissions -o noappledouble -o volname=patterns $USERNAME@assets_sftp.boston.gov:/patterns.boston.gov/assets $MOUNTLOCATION

From the Terminal app, make the file executable: ```chmod +x /Library/Startup.sh```
Get to this file using the `Finder`, then right-click on the file and select the 'Get Info' option. Use the 'Open with:' to use 'Terminal'. Can be found under 'Applications > Utilities' and check the 'Enable' drop down to 'All Applications'
Open up “System Preferences” and go to “Users & Groups”
Switch to the “Login Items” tab, unlock the ability to edit these settings by clicking the Padlock in the bottom left.
Use the “+” button to add a new action in “Login Items”, this will open up a file browser window.
Use the File Browser to locate the “Startup.sh” file we created in the “Library” and select it.
Use the Apple icon on the top left of the screen to “Log Out”
When you sign in again open up a “Finder” window and check if the drive mounted at ~/mnt/patterns

Debug Tips

# Manual Mount
sshfs -o IdentityFile=~/.ssh/id_rsa_cob -odebug,sshfs_debug,loglevel=debug -o defer_permissions -o noappledouble -o volname=patterns phillip_kelly@assets_sftp.boston.gov:/patterns.boston.gov/assets ~/mnt/patterns/

Unmount
- diskutil unmount ~/mnt/assets/
- sudo mount -a

CHOWN
780  sudo chown phillipkelly /Users/phillipkelly/mnt/patterns2/

CHMOD
672  sudo chmod 777 ~/mnt/assets
794  sudo chmod +x ~/Library/Startup.cmd