AWS Beanstalk and more

While playing around with AWS Beanstack (EB) and related infrastructure, found out couple of thins that want to note for future reference.

Initial scenario: AWS Beanstalk environment in a private subnet, with AWS LoadBalancer (LB) serving requests and doing SSL termination. No need to do end-to-end SSL as actual EC2 instances are in private subnet. Also need to redirect HTTP to HTTPs, preferably on load balancer level.

 

So we first we need to create a VPC, Internet Gateway a some subnets. I have one public subnet (local routes + default via Internet GW) for service things like NAT gateways, bastion hosts, etc, two private subnets (local traffic only) for web instances (in different AZ) and two public subnets for load balancers.

Here come few first notes:

  • two have EC2 Instances of web servers in private subnet, you need to have LB in public subnet.
  • private subnets with EC2 Instances should have a routing table with default route to NAT Gateway, which MUST be in a public subnet (either along with LB or separate service subnet)
  • VPC MUST have DNS resolution and DNS hostnames set to YES

The above is more or less enough to start with EB, as it will create bunch of staff for you automatically (or you can find a way to mess around with plenty of things).

For EB:

  • Make sure the EB environment is linked to appropriate VPC before you start checking other settings, as this is what brings all those subnets, security groups and other options in.
  • User multi-instance setup with LB even if you plan to have only single instance. Just set the auto scaling max to 1, but this will give you bunch of options and flexibility later on.
  • Use Application LB instead of Classic one
  • Use your proper Key-Pair for EC2 instances in Security section as it will give you a chance to SSH to instances to troubleshot in case of problems (via bastion host or by temporarily making web subnet public and attaching elastic IP to instance)
  • Modify webroot in software configuration in case your project is not served directly from root of the project (/public, /webroot, etc)
  • Utilize Environment Variables for passing info about DB settings, DEBUG, etc, if your application supports it. Very handy.
  • Use Amazon Certificate Manager + Route53 for issuing and renewing SSL certificates that you can attach to LB.
  • Make sure you have both listeners in LB setup: for HTTP and HTTPs

When you environment is up and running, there are couple of things to adjust:

  • CNAME you environment domain to EB entrypoint domain
  • In EC2, Modify rule for EB Listener that is working on port 80 and add a rule on top of default one to redirect to HTTPS (same origin, path, args) when Path is *
  • In case you app uses full URLs, you may have a problem that it sends links in HTTP. In my case I am passing BASE_URL env var to the environment of the EC2 instances and my app picks it up from there and returns correct links and refs to other resources like JS & CSS.

This is just a short list of things to keep in mind. More things might come and most should go to AWS CloudFormation, but I had to play around via AWS Management Console manually first to get the feeling of the service.

mount and systemd

Had a task: double the size of a volume on amazon AWS EC2 instance. The process is yet manual and it is roughly as follows:

  • Create a new volume on AWS with double the size of the the old one
  • Attach it to the instance
  • Create partition and filesystem on the new volume
  • Mount the new volume somewhere next to the old volume mount point
  • Rsync data from the old volume to the new volume
  • Adjust /etc/fstab to point to the new volume for the corresponding mount point
  • Unmount both volumes
  • Mount the new volume to the old mount point
  • Detach the old volume from instance
  • Delete the old volume

All pretty simple and strait-forward. BUT! The new volume is not mounting to the old mount point, while mount command is silent about it!!!

Syslog gives a hint:

systemd: Unit var-whatever.mount is bound to inactive unit dev-xvdg1.device. Stopping, too.
lb1 systemd: Unmounting /var/whatever...
lb1 systemd: Unmounted /var/whatever.

That’s interesting. Why it is still bounded to inactive device (which I have already detached) and how I can unbound it?

Apparently all records in /etc/fstab are converted to systemd units and all mounting is (these ugly days) done via systemd. So when I changed /etc/fstab, the systemd didn’t update the the related unit and was still trying to mount the old device. To fix the problem you need to run:

systemctl daemon-reload

I am too old for this shit… Why are simple things getting more and more complicated (firewalld? ;-))

AWS net.ipv4.tcp_tw_recycle follow-up

Yesterday I wrote a post on AWS EC2 instance networking problem that I was pretty surprised to find out. And while yesterday I was focusing on fixing the problem, today my first task was to find out what actually sets the flag, and quick grep on /etc of the instance revealed that the settings were applied by /etc/sysctl.d/net.ipv4.tcp_tw_recycle.

Very strange to find it there along with net.ipv4.tcp_tw_reuse which is also something that you should not touch. Anyhow, the problem identified, fixed and is about to be added to monitoring…

Amazon AWS, WTF?

Spend the whole day troubleshooting a problem of some pretty random, but stable tcp connection timeouts to one of the Amazon AWS EC2 instance. The problem was that some PCs/laptops/servers would face long term connection timeout to the instance, while others were working fine. The ones with timeouts would experience problems only on TCP level, while ICMP ping would pass normally. The other strange thing is that rebooting client to different kernel would fix the problem for that particular client for a while.

After checking and googling with no luck and getting completely pissed off, I gave the problem another thought and this time I felt that something is wrong with AWS NATting. That clearly brought the memories of troubleshooting TCP fine tuning. So I checked the article, found out the values to make sure are present and went to check the actual instance. Quick look into /proc/sys/net/ipv4/tcp_tw_recycle revealed the problem with its value being 1, so changing it back to 0 with cat to apply immediately fixed the connectivity issues, but then, when I looked into /etc/sysctl.conf, I saw that the value there was already 0!!! How come is it possible if we didn’t change it manually via proc, nor have we touched sysctl.conf for ages and the last server reboot was only few days ago done by Amazon due to their planned maintenance?

 

Amazon AWS Subnet Custom Gateway

While Amazon provides different ways to route traffic within and out of your subnets by means of internet gateways and NAT gateways, it’s not always the case that you it will suite your needs. If you want full control with lots of possibilities for customisation, you might consider building your own firewall instance and push all traffic via it.

Amazon provides NAT instances, but they also have some limitations, so to gain full features, its is possible to build a custom EC2 instance with whatever AMI and settings you like, attach two network interfaces to it where one is in a private and one in a public subnet and do classic iptables NAT on it.

For all instances in the private subnet to be routed via your custom firewall, you need to adjust the routing table for that subnet and point default route to the network interface of the firewall instance that is in that same private subnet.

All of the above works pretty good, but looks a bit weird: lets assume the following:

  • We have a VPC 10.10.0.0/16
  • We have a private subnet 10.10.10.0/24 within our VPC
  • We have a public subnet 10.10.20.0/24 within our VPC
  • We have a firewall instance with:
    • public subnet IP: 10.10.10.10
    • private subnet IP: 10.10.20.20
  • We have a host in private subnet with IP 10.10.10.50

Now, the first question is why we don’t use 10.10.10.1 on firewall instance in a private subnet? Easy: it is used by amazon gateway and even though we have created routing table to through all to 10.10.10.10, on an actual host in a private subnet, the routing table will be:

default via 10.10.20.1 dev eth0 
10.10.20.0/24 dev eth0 proto kernel scope link src 10.10.20.50 

This means that host will send traffic to AWS gateway, and that one will pass it over to our firewall. Probably the idea behind such configuration is that AWS still needs to check security groups and so on, before it handles traffic to us.

Cool, and this work fine, but there is a small issue: if you will try to ping or access any services at firewalls public IP (10.10.10.10) from the host in your private subnet – you will fail! Moreover, if you fire up tcpdump on a firewall server listening for any packets from host in private subnet via interface of private subnet (10.10.20.20) and try to ping 10.10.10.10 from the private host – you will see completely nothing related to this. Nor you will see any other activity from your private host towards public IP address of your firewall.

Wanna go even more weird? If you will try to access any other hosts in the public subnet from your private host (for a sake of example assume you have another host with IP 10.10.10.99 and you try to ping it from 10.10.20.20): this will work as expected and traffic will flow via firewall as configured.

Not sure why and how, but Amazon seems to block access on 10.10.20.1 from any host in 10.10.20.0/24 network to 10.10.10.10, because that IP belongs to a firewall that is a default gateway for any host on 10.10.20.0/24 (even though the IP is another subnet).

The solution for this problem (if that’s a problem for your case) is either to put a direct route to 10.10.10.10 via 10.10.20.20 on the private host to make sure private host avoids using amazon 10.10.20.1 for this route:

default via 10.10.20.1 dev eth0 
10.10.10.10/32 via 10.10.20.20 dev eth0
10.10.20.0/24 dev eth0 proto kernel scope link src 10.10.20.50

or even completely ignore 10.10.20.1 and set default gateway via 10.10.20.20:

default via 10.10.20.20 dev eth0 
10.10.20.0/24 dev eth0 proto kernel scope link src 10.10.20.50

In either way you won’t be able to do it via AWS routing tables, but will have to configure routing right on the private host via ip route tool or route-ifcfgN file for persistence.

Keep in mind that if you will divert all traffic via 10.10.20.20, you might lose some security group checks within a subnet, so make sure to implement whatever security you need on the actual firewall.