Spend the whole day troubleshooting a problem of some pretty random, but stable tcp connection timeouts to one of the Amazon AWS EC2 instance. The problem was that some PCs/laptops/servers would face long term connection timeout to the instance, while others were working fine. The ones with timeouts would experience problems only on TCP level, while ICMP ping would pass normally. The other strange thing is that rebooting client to different kernel would fix the problem for that particular client for a while.
After checking and googling with no luck and getting completely pissed off, I gave the problem another thought and this time I felt that something is wrong with AWS NATting. That clearly brought the memories of troubleshooting TCP fine tuning. So I checked the article, found out the values to make sure are present and went to check the actual instance. Quick look into /proc/sys/net/ipv4/tcp_tw_recycle revealed the problem with its value being 1, so changing it back to 0 with cat to apply immediately fixed the connectivity issues, but then, when I looked into /etc/sysctl.conf, I saw that the value there was already 0!!! How come is it possible if we didn’t change it manually via proc, nor have we touched sysctl.conf for ages and the last server reboot was only few days ago done by Amazon due to their planned maintenance?