Nginx rate limit by user agent (control bots)

As search engine indexing bots are getting more and more intelligent and thus more aggressive, sometimes they become really annoying or even can affect the performance of the system.

While Nginx is a very powerful and flexible system, it is not always clear how to put all the configuration together to do the job. It is getting even harder when single Nginx server serves multiple virtual hosts and you want to apply the same policy for all of them from withing the http section of the configuration, instead of the server section of each site.

For any rate limiting, Nginx uses the ngx_http_limit_req_module module and it is pretty much strait-forward to limit based by IP address or any other simple value, but for advanced configuration you need to use maps with either static keys or regexp. That’s where things are getting more confusing, especially if you want to have some defaults and white-listing (exclusion from limiting) based on certain condition.

The Nginx documentation for rate limiting with regards to exclusions states:

The key can contain text, variables, and their combination. Requests with an empty key value are not accounted.

Sounds easy, but to find the correct way to implement such an exclusion took me quite some time of googling, reading, trying failing, googling again and so on and so forth. So just to have a correct solution documented somewhere closer to me, will just cover it here.

For the clarity, and more extended solution, let’s assume we want to limit user agents that match (GoogleBot|bingbot|YandexBot|mj12bot) pattern to 1 request from IP per minute burstable to 2 and the rest of the world to 10 requests per IP per second burstable to 15. To do this, the http section of the nginx.conf has to have the following part:

map $http_user_agent $isbot_ua {
        default 0;
        ~*(GoogleBot|bingbot|YandexBot|mj12bot) 1;
}
map $isbot_ua $limit_bot {
        0       "";
        1       $binary_remote_addr;
}

limit_req_zone $limit_bot zone=bots:10m rate=1r/m;
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;

limit_req zone=bots burst=2 nodelay;
limit_req zone=one burst=15 nodelay;

The trick here is that we need to use two maps where first one sets a value 0 based on the $http_user_agent and the second one sets the empty value for everyone who got in the first map, but $binary_remote_addr for the ones who got value 1 in the same first map. The idea is that in order for nginx whitelist the request from limit zone, the return key and value from the map have to be empty (0 for key and “” for value), so the first map sets 0 for the value, and the second map takes that value as it’s key and sets the “” value.

Rest of the configuration parameters are pretty much easy to understand and I won’t cover them here, since you can easily refer to nginx documentation.

To make things even nicer, we can also tell nginx to send a good HTTP 429 code (Too Many Requests) when someone is above the limit and hope that the requester will interpret accordingly and slow down. To make this, just add the following line in the same part of nginx configuration:

limit_req_status 429;

If you are using the limit_conn directive anywhere for nginx, you can add the same thing for it as well:

limit_conn_status 429;

Hope the above will save me and maybe even someone else some time and more similar posts to come later as I am getting my hands on different things.

P.S.: do you know the difference between “~” and “~*” for nginx configuration when dealing with regex? The answer is pretty simple: while the first one matches case-sensitive, the last one matches case-insensitive ;-)