server-group instance replacement does not transfer ENI #720

adamlundrigan · 2023-05-27T13:37:18Z

adamlundrigan
May 27, 2023

We use the server-group module of terraform-aws-asg to run HAProxy instances for outbound connection proxying. Our EIPs are attached to ENIs and those ENIs are attached to the server by the attach-eni command (from terraform-aws-server) in user-data.

This works great when we use Terraform to force a rolling deploy; the old instance is removed, the ENIs detached, a new instance started, and the ENIs reattached.

However, when the ASG decides to replace an instance, it does so in the opposite order - it spins up a new instance then removes the old one. This means that the new instance can't attach the ENIs during first boot. Insufficient error checking meant the ASG didn't know the new instance was "incomplete", so we ended up with a proxy running which could not forward any traffic.

Troubeshooting notes from our internal ticket

10.30.1.44 and 10.30.1.60 are on the same proxy server (halonmta-proxy-0 in eu-west-1)
SSHed into that instance to check on HAProxy (sudo systemctl status haproxy) - it was running
Checked network configuration (ip addr show) and noted that ens6 and ens7 were not present
Confirmed via AWS console that halonmta-proxy-svr0-* ENIs were no longer attached to the instance
Checked the history on the ASG which manages the instances; logs show it was replaced due to failed health check

Manually re-attached the ENIs to the instance using the AWS Console, then ran sudo netplan apply from the instance command line; traffic began flowing again

The relevant section of the user-data.sh script for the ASG looks like this:

  NIC_INDEX=0
  while read NIC_ID NIC_INDEX
  do
    echo "ATTACHING: $NIC_ID"
    attach-eni --eni-id $NIC_ID --device-index $(($NIC_INDEX+1))
    sudo rm /etc/netplan/51-ens*.yaml
  done < <(aws ec2 describe-network-interfaces --region "${aws_region}" --filters "Name=status,Values=available" "Name=tag:ServerGroupInstanceName,Values=$${MY_INSTANCE_NAME}" | jq -r '.NetworkInterfaces[] | [.NetworkInterfaceId, (.TagSet|from_entries|.ServerGroupInterfaceIndex)] | @tsv')

This finds all the available ENIs tagged for the server and attaches them in order.
In the replacement case the ENIs are still attached to the failing server so their status is not available, meaning they don't get attached.

I've looked through the ASG documentation and don't see a way to make it behave like the rolling-deploy script - terminate first then spin a new instance.

Any suggestion on what we should do to prevent this issue from reoccurring?

I guess the user-data script could use the AWS CLI to force-detach the ENIs?
https://docs.aws.amazon.com/cli/latest/reference/ec2/detach-network-interface.html

Tracked in ticket #110206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gruntwork

server-group instance replacement does not transfer ENI #720

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Gruntwork

server-group instance replacement does not transfer ENI #720

adamlundrigan May 27, 2023

Replies: 0 comments

adamlundrigan
May 27, 2023