Updating the ECS cluster AMI
How to update the AMI on our ECS cluster instances
Last updated
How to update the AMI on our ECS cluster instances
Last updated
The Digital webapps cluster uses the Elastic Container Service on AWS. We have a handful of EC2 instances that actually host the containers.
These instances use a stock Amazon Machine Image (AMI) from Amazon designed for Docker that comes with the ECS agent pre-installed. From time to time, Amazon releases a new version of this “ECS-optimized” image, either to upgrade the ECS agent or the underlying OS.
Thanks to our instance-drain Lambda function, updating the cluster EC2 images is a zero-downtime process. Nevertheless, it’s best to run this during the weekly digital maintenance window, and make sure that staging looks good before doing it on production.
This process is sometimes referred to as “rolling” the cluster though it’s more accurate that we set up a second cluster of machines and migrate to it.
Find the latest ID for the ECS-optimized AMI. You can do this on the Amazon ECS-optimized AMIs page.
In a browser navigate to CityOfBoston/digital-terraform repository and edit the apps/clusters.tf
file.
Update the instance_image_id
value for the staging_cluster
module to the new AMI ID from step 1 above. Save/commit the file as a new branch, not directly to the production
branch.
Make a PR which merges the new branch into the production
branch, and assign a person to review the changes.
When you make the PR, GitHub will automatically execute an atlantis plan
process (see what atlantis is).
When the plan is done, inspect the output and expect to see changes to:
- resource "aws_autoscaling_group" "instances"
- resource "aws_cloudwatch_metric_alarm" "low_instance_alarm"
- resource "aws_launch_configuration" "instances" (This last one will have the new AMI guid)
Any other changes the plan identifies should be carefully investigated.
Terraform may be proposing to make changes to the AWS environment you don't want, or at least are not expecting.
After viewing the plan, if you need to update the terraform scripts, be sure to save the changes to the new branch.
If comitting your changes does not trigger the atlantis plan automatically, you can run it manually by creating a new comment with atlantis plan
.
Once the atlantis plan is finished, and the PR has been approved, create a new comment atlantis apply.
This will cause Atlantis to apply changes to AWS. (Atlantis runs a terraform apply
command in a background process). See what happens.
Keep an eye on the “ECS Instances” tab in the cluster’s UI. You should see the “Running tasks” on the draining instance(s) go down, and go up on the new instances.
Once all the tasks have moved, the old instance(s) will terminate and Terraform will complete. Check a few URLs on staging to make sure that everything’s up-and-running.
Now that Atlantis’s apply finished, you can merge the staging PR and repeat the process (steps 2-6) for the production cluster.
If you have terraform installed on your local computer, you can do the update directly from your computer.
Find the latest ID for the ECS-optimized AMI. You can do this on the Amazon ECS-optimized AMIs page.
Ensure your cloned copy of the digital-terraform
repository is on the production
branch, and that the branch it up to date with the origin on GitHub.
Create a new branch from the production
branch.
In your preferred IDE open the /apps/clusters.tf
file and update the instance_image_id
value for the staging_cluster
module to the new AMI ID from step 1 above. Save/commit the file to the new new branch (not directly to the production
branch).
in a terminal/shell from the repo/apps/
folder, run the command:
terraform plan
When the plan is done, inspect the output and expect to see changes to: - resource "aws_autoscaling_group" "instances" - resource "aws_cloudwatch_metric_alarm" "low_instance_alarm" - resource "aws_launch_configuration" "instances" (This last one will have the new AMI guid) Any other changes the plan identifies should be carefully investigated. Terraform may be proposing to make changes to the AWS environment you don't want, or at least are not expecting.
Once you are happy with the changes that terraform will apply to the AWS environment, you can run the command:
terraform apply
See what terraform apply does.
Keep an eye on the “ECS Instances” tab in the cluster’s UI. You should see the “Running tasks” on the draining instance(s) go down, and go up on the new instances.
Once all the tasks have moved, the old instance(s) will terminate and Terraform will complete. Check a few URLs on staging to make sure that everything’s up-and-running.
Now that terraform's apply is finished, you can repeat the process (steps 2-9) for the production cluster.
Finally you should merge the changes in your new (local) branch into the local production
branch, and then push the your local production
branch to the origin in Github.
After the production instances are fully up, check that they have roughly equal “Running tasks” numbers. ECS should schedule duplicate tasks on separate machines so that they are split across AZs. If you see a service has both of its tasks on the same instance you can run a force deployment to restart it. (See Restarting an ECS service)