Intent: The intent of a rate limiter is to control the rate at which requests or actions are allowed to occur within a system or API. It sets limits on the number of requests or actions that can be made within a specific time frame, preventing excessive usage, protecting system resources, ensuring fair allocation, and mitigating the risk of performance degradation or abuse. The rate limiter helps maintain system stability, scalability, and a consistent user experience by enforcing predefined thresholds on request rates.

Problem

At times, your application might experience a rapid or sustained increase in the load. If the application isn’t designed to handle the increased load, the application or the resources that it uses might fail, making the application unavailable. The increased load might be caused by malicious requests, such as network-based distributed denial-of-service (DDoS) attacks. A sudden spike in the load can also occur due to other reasons such as configuration errors in the client software.

Why not auto-scale?

We can overcome this issue with auto-scaling. Auto-scaling might be costly, and in this scenario, the above requests were from an attack and not the real customer requests. So auto-scaling won’t be a viable solution since the newly provisoned instances will also be victims of DDOS.

To ensure that your application can handle excessive load, consider applying suitable rate-limiting mechanisms.

Rate-limiting techniques can also help optimize the cost of your cloud infrastructure. For example, by setting project-level quotas for specific resources, you can limit the billing that the project can incur for those resources.

When to use rate limiter?

To Prevent Denial Of Service

Rate limiting is also effective in preventing Denial of Service (DoS) attacks. In a DoS attack, an attacker overwhelms an application by flooding it with requests, causing it to become unresponsive or crash. By limiting the rate of incoming requests, rate limiting helps protect the system from being overwhelmed and ensures the availability of resources. This is particularly important in Distributed DoS (DDoS) attacks where multiple sources simultaneously bombard the system.

To Prevent Brute Force Attacks

Rate limiting is commonly used to combat brute force attacks by blocking excessive requests. In a brute force attack, hackers automate a continuous stream of requests in the hopes of finding a vulnerability. By implementing rate limiting, the system can slow down or block these attacks, and anomalous patterns of failed requests can trigger alerts for further investigation.

To Prevent Resource starvation

Rate limiting plays a crucial role in preventing resource starvation, improving the availability of API-based services. Unintentional resource starvation, often referred to as friendly-fire DoS, can occur due to software errors or misconfigurations. By applying rate limiting, services can avoid exhausting critical resources, such as databases, ensuring that requests are processed within sustainable limits. This safeguards the overall system performance and prevents contention for resources.

To Prevent Cascading Failures

Rate limiting acts as a defensive measure for services, protecting them from excessive usage and reducing the risk of cascading failures. Even in highly scalable systems, setting limits on consumption is essential. By implementing rate limiting on both the client and server sides, throughput can be maximized, and end-to-end latency can be minimized across large distributed systems. This ensures that individual components do not become overloaded, leading to a system-wide failure.

Bulk Operations or Batch Processing

Rate limiters can be useful when performing bulk operations or batch processing tasks. They help regulate the rate at which these operations are executed, preventing system overload, and allowing for efficient and controlled processing.

Rate Limiter at Different Levels

User-Level Rate Limiting

In cases where a system can uniquely identify a user, user-level rate limiting can be implemented. This approach restricts the number of API requests a user can make within a specified time period. For instance, if a user is allowed only two requests per second, any third request made within the same second is denied. User-level rate limiting ensures fair usage among users, but it may introduce overhead in maintaining usage statistics for each user, potentially draining system resources.

Server-Level Rate Limiting

In distributed systems where API-based services are spread across multiple servers, server-level rate limiting plays a crucial role. It enables load sharing among servers to optimize resource utilization. By setting a limit on the number of requests a specific server can handle, server-level rate limiting ensures that no single server becomes overwhelmed. If a server exceeds its rate limit, requests are either dropped or rerouted to other available servers. This approach enhances system availability and prevents denial of service attacks targeting a particular server.

Geography-Based Rate Limiting

API-based services often have servers located in different regions worldwide. Geography-based rate limiting is implemented to control the number of service requests originating from specific geographic areas. This approach can also be time-based, meaning that rate limiting rules can be set for particular periods. For instance, if requests from a specific geographic location are typically low between 1:00 am and 6:00 am, a web server can enforce rate limiting during that timeframe. If an attack occurs during those hours and the number of requests spikes, the rate limiting mechanism triggers an alert, allowing the organization to promptly respond to the attack.

Designing Rate Limiter

Functional Requirements

Flexible Rate Limiting Rules: The API should support the definition and management of multiple rate-limiting rules. This enables the configuration of different rate limits based on various criteria such as endpoints, users, or specific actions.
Customizable Response Handling: The API should provide the capability to customize the response sent to clients when rate limits are exceeded. This allows for tailored error messages or specific headers to inform clients about the rate limit status and any necessary actions to be taken.
Storage and Retrieval of Rate-Limit Data: The API should include mechanisms for storing and retrieving rate-limit data. This ensures that relevant information such as request counts, timestamps, and associated metadata can be efficiently recorded and accessed for enforcement and monitoring purposes.
Proper Error Handling: The API should be implemented with appropriate error handling mechanisms. When the threshold limit of requests is exceeded for a single server or across different combinations, the API should respond with informative and meaningful error messages. This helps clients understand the cause of the error and take appropriate actions to adhere to the rate limits.
Expose an endpoint: Finally, the rate limiter API should expose an endpoint for clients to check their current rate limit status. This endpoint can return the number of requests remaining within the time window, as well as the time at which the limit will be reset.
Set headers: When a request is allowed, the API should set appropriate headers in the response to indicate the remaining number of requests that the client can make within the time window, as well as the time at which the limit will be reset.

Non Functional Requirements

The API should be highly available and scalable. Availability is the main pillar in the case of request fetching APIs.
The API should be secure and protected against malicious attacks.
The API should be easy to integrate with existing systems.
There should be low latency provided by the rate limiter to the system, as performance is one of the key factors in the case of any system.

Where to put Rate Limiter?

Client Side

Placing the rate limiter on the client-side involves incorporating rate limiting logic directly into the client application or client library.
This approach allows clients to proactively manage their own request rates and avoid exceeding predefined limits.
Needless to say, we have Reduced Network Traffic, Immediate Feedback & client autonomy.
But Generally, client is an unreliable place to enforce rate limiting because client requests can easily be forged by malicious actors. Moreover, we might not have control over the client implementation.

Server Side

Image source: System Design by Alex Xu

Placing the rate limiter on the server-side involves implementing rate limiting logic within the server infrastructure.
This allows the server to handle rate limiting centrally and enforce limits on incoming requests from various clients.
Server-side rate limiting is useful when there is a need to protect server resources, maintain fairness across clients, or when clients cannot be fully trusted to adhere to rate limits.
Transmitting rate-limited requests to the server can lead to more network traffic, which can put a heavier load on the network. If many requests are rate-limited, it may take longer for clients to be served, causing increased latency. Additionally, enforcing rate limits on the server-side can be problematic in highly distributed or high-traffic systems, affecting their scalability.

Middleware

Image Source: System Design by Alex Xu

Assume our API allows 2 requests per second, and a client sends 3 requests to the server within a second. The first two requests are routed to API servers. However, the rate limiter middleware throttles the third request and returns a HTTP status code 429. The HTTP 429 response status code indicates a user has sent too many requests.

Middleware rate limiters are placed in between the client and server components, acting as a separate layer responsible for enforcing rate limits.
This approach involves using dedicated middleware components or API gateways that intercept and process incoming requests before forwarding them to the server.
Cloud microservices have become widely popular and rate limiting is usually implemented within a component called API gateway. API gateway is a fully managed service that supports centralized rate limiting capabilities, separation of concerns and features like rate limiting, SSL termination, authentication, IP whitelisting, servicing static content, etc.
However, If the middleware rate limiter becomes a single point of failure, it can impact the overall system’s availability and reliability.

Algorithms for Rate Limiter

Token Bucket

The token bucket algorithm is a popular method used in rate limiting to control the rate of requests or actions. It operates based on the concept of a token bucket, which is a metaphorical bucket holding tokens.

Image Source: System Design By Alex Xu

In the token bucket algorithm, a bucket is initially filled with a predetermined number of tokens. Each token represents permission to make a request or perform an action. When a request is made, a token is consumed from the bucket. If the bucket is empty, meaning there are no more tokens available, further requests are either blocked or delayed until more tokens are added to the bucket.

Tokens are added to the bucket at a fixed rate over time, known as the token generation rate or refill rate. This determines how quickly the bucket is replenished with tokens. If the request rate exceeds the refill rate, the bucket will eventually become empty, and requests beyond the available tokens will be limited.

The token bucket algorithm allows for burstiness in requests. If tokens are available in the bucket, multiple requests can be made at once, as long as there are enough tokens to accommodate them. However, if the burst of requests exceeds the token bucket’s capacity, the excess requests will be subject to rate limiting.

This algorithm offers flexibility in rate limiting by adjusting the token generation rate to control the overall rate of requests. It provides a predictable and adjustable mechanism for enforcing rate limits while allowing occasional bursts of traffic within the defined limits.

The token bucket algorithm is widely used in various systems and network components to regulate the flow of requests, prevent abuse, and ensure fair usage of resources.

import java.time.Instant;

public class TokenBucketRateLimiter {
    private int maxTokens;
    private int tokens;
    private long lastRefillTimestamp;
    private int refillRate;

    public TokenBucketRateLimiter(int maxTokens, int refillRate) {
        this.maxTokens = maxTokens;
        this.refillRate = refillRate;
        this.tokens = maxTokens;
        this.lastRefillTimestamp = Instant.now().getEpochSecond();
    }

    public synchronized boolean allowRequest() {
        refillTokens();

        if (tokens > 0) {
            tokens--;
            return true;
        }

        return false;
    }

    private void refillTokens() {
        long now = Instant.now().getEpochSecond();
        long elapsedTime = now - lastRefillTimestamp;
        int newTokens = (int) (elapsedTime * refillRate);

        if (newTokens > 0) {
            tokens = Math.min(tokens + newTokens, maxTokens);
            lastRefillTimestamp = now;
        }
    }

    public static void main(String[] args) {
        // Example usage
        TokenBucketRateLimiter rateLimiter = new TokenBucketRateLimiter(10, 2);

        // Simulating requests
        for (int i = 1; i <= 15; i++) {
            boolean allowed = rateLimiter.allowRequest();
            if (allowed) {
                System.out.println("Request " + i + ": Allowed");
            } else {
                System.out.println("Request " + i + ": Rate Limited");
            }
        }
    }
}

Java

Leaky Bucket algorithm

It operates on the principle of a leaky bucket where data fills the bucket and spills out at a predetermined constant rate.

Here’s how the leaky bucket algorithm works:

Bucket Initialization: Imagine a bucket with a fixed capacity. The bucket initially starts empty.
Incoming Data: As data packets or traffic arrive, they are added to the bucket. Each packet or unit of data represents a token in the context of the algorithm.
Token Leaking: At a constant rate, tokens (representing data packets) are “leaked” or removed from the bucket. The leaking rate determines the maximum allowed rate of outgoing traffic.
Rate Limiting: If the bucket overflows and becomes full, any additional incoming packets are either dropped or delayed until there is space in the bucket. This ensures that the outgoing rate does not exceed the predetermined rate.

The algorithm works as follows:

When a request arrives, the system checks if the queue is full. If it is not full, the request is added to the queue.
Otherwise, the request is dropped.
Requests are pulled from the queue and processed at regular intervals.

import java.time.Instant;

public class LeakyBucketRateLimiter {
    private int capacity;
    private int tokens;
    private long lastLeakTime;
    private int leakRate;

    public LeakyBucketRateLimiter(int capacity, int leakRate) {
        this.capacity = capacity;
        this.leakRate = leakRate;
        this.tokens = 0;
        this.lastLeakTime = Instant.now().getEpochSecond();
    }

    public synchronized boolean allowRequest() {
        leakTokens();

        if (tokens < capacity) {
            tokens++;
            return true;
        }

        return false;
    }

    private void leakTokens() {
        long now = Instant.now().getEpochSecond();
        long elapsedTime = now - lastLeakTime;
        int tokensToLeak = (int) (elapsedTime * leakRate);

        if (tokensToLeak > 0) {
            tokens = Math.max(tokens - tokensToLeak, 0);
            lastLeakTime = now;
        }
    }

    public static void main(String[] args) {
        // Example usage
        LeakyBucketRateLimiter rateLimiter = new LeakyBucketRateLimiter(10, 2);

        // Simulating requests
        for (int i = 1; i <= 15; i++) {
            boolean allowed = rateLimiter.allowRequest();
            if (allowed) {
                System.out.println("Request " + i + ": Allowed");
            } else {
                System.out.println("Request " + i + ": Rate Limited");
            }
        }
    }
}

Java

Fixed Window Counter

Here’s how the fixed window counter algorithm works:

The algorithm divides the timeline into fix-sized time windows and assign a counter for each window.
Each request increments the counter by one.
Once the counter reaches the pre-defined threshold, new requests are dropped until a new time window starts.

The fixed window counter algorithm provides a straightforward way to enforce rate limits within fixed time intervals. However, it may result in burstiness, as a high number of requests may be allowed at the beginning of each time window, potentially causing uneven distribution or spikes in traffic.

To address burstiness, other rate limiting algorithms like token bucket or leaky bucket can be used, as they offer more flexible and dynamic control over the rate of requests. Nonetheless, the fixed window counter algorithm serves as a fundamental concept in rate limiting and can be suitable for certain scenarios where burstiness is acceptable or not a significant concern.

import java.time.Instant;
import java.util.concurrent.TimeUnit;

public class FixedWindowRateLimiter {
    private int rateLimit;
    private int windowSizeInSeconds;
    private int requestCount;
    private long windowStartTime;

    public FixedWindowRateLimiter(int rateLimit, int windowSizeInSeconds) {
        this.rateLimit = rateLimit;
        this.windowSizeInSeconds = windowSizeInSeconds;
        this.requestCount = 0;
        this.windowStartTime = Instant.now().getEpochSecond();
    }

    public synchronized boolean allowRequest() {
        resetIfWindowExpired();

        if (requestCount < rateLimit) {
            requestCount++;
            return true;
        }

        return false;
    }

    private void resetIfWindowExpired() {
        long currentTime = Instant.now().getEpochSecond();
        long elapsedTime = currentTime - windowStartTime;

        if (elapsedTime >= windowSizeInSeconds) {
            requestCount = 0;
            windowStartTime = currentTime;
        }
    }

    public static void main(String[] args) throws InterruptedException {
        // Example usage
        FixedWindowRateLimiter rateLimiter = new FixedWindowRateLimiter(10, 60);

        // Simulating requests
        for (int i = 1; i <= 15; i++) {
            boolean allowed = rateLimiter.allowRequest();
            if (allowed) {
                System.out.println("Request " + i + ": Allowed");
            } else {
                System.out.println("Request " + i + ": Rate Limited");
            }

            TimeUnit.SECONDS.sleep(1); // Simulating request rate
        }
    }
}

Java

Sliding Window Logs

The fixed window counter algorithm has a limitation at the edges of a window, allowing more requests to pass through. To overcome this issue, the sliding window log algorithm provides a solution. Here’s how it works:

The algorithm maintains a log of request timestamps, often stored in a cache like Redis sorted sets.
When a new request arrives, outdated timestamps older than the start of the current time window are removed from the log.
The timestamp of the new request is added to the log.
If the size of the log is equal to or smaller than the allowed count, the request is accepted. Otherwise, it is rejected.

By dynamically managing the log of timestamps, the sliding window log algorithm ensures a consistent rate limit without the inconsistencies of the fixed window counter approach.

Here’s how the sliding window logs algorithm works:

Define a Time Window: Determine the desired duration of the sliding window. For example, a sliding window of 10 minutes will consider only log events from the past 10 minutes.
Create a Data Structure: Set up a suitable data structure, such as a circular buffer or a queue, to store log events within the sliding window. The size of the data structure should accommodate the maximum number of log events that can occur within the time window.
Log Event Processing: As new log events arrive, add them to the data structure, pushing out older events if necessary to maintain the fixed time window size. This ensures that the data structure only contains the most recent events within the sliding window.
Analysis and Filtering: Perform analysis or filtering operations on the log events within the sliding window. For example, you can compute statistics, generate real-time alerts, or extract relevant information based on specific criteria.
Window Sliding: Periodically slide the window forward by removing older log events that fall outside the current time window. This ensures that the data structure remains up-to-date with the latest log events and allows for continuous processing.

By using the sliding window logs algorithm, you can efficiently handle and analyze log events within a defined time window. This approach is particularly useful for real-time monitoring, anomaly detection, or any scenario where recent log events hold more significance than older ones.

import java.time.Instant;
import java.util.ArrayDeque;
import java.util.Queue;
import java.util.concurrent.TimeUnit;

public class SlidingWindowRateLimiter {
    private int rateLimit;
    private int windowSizeInSeconds;
    private Queue<Long> requestQueue;

    public SlidingWindowRateLimiter(int rateLimit, int windowSizeInSeconds) {
        this.rateLimit = rateLimit;
        this.windowSizeInSeconds = windowSizeInSeconds;
        this.requestQueue = new ArrayDeque<>();
    }

    public synchronized boolean allowRequest() {
        long currentTime = Instant.now().getEpochSecond();
        long windowStartTime = currentTime - windowSizeInSeconds + 1;

        if (requestQueue.size() < rateLimit) {
            requestQueue.add(currentTime);
            return true;
        } else if (requestQueue.peek() >= windowStartTime) {
            return false;
        } else {
            requestQueue.poll();
            requestQueue.add(currentTime);
            return true;
        }
    }

    public static void main(String[] args) throws InterruptedException {
        // Example usage
        SlidingWindowRateLimiter rateLimiter = new SlidingWindowRateLimiter(10, 60);

        // Simulating requests
        for (int i = 1; i <= 15; i++) {
            boolean allowed = rateLimiter.allowRequest();
            if (allowed) {
                System.out.println("Request " + i + ": Allowed");
            } else {
                System.out.println("Request " + i + ": Rate Limited");
            }

            TimeUnit.SECONDS.sleep(1); // Simulating request rate
        }
    }
}

Java

Key Design Considerations

Use client cache to avoid making frequent API calls.
Understand the limit and do not send too many requests in a short time frame.
Include code to catch exceptions or errors so your client can gracefully recover from exceptions.
Add sufficient back off time to retry logic.

Rate Limiter Providers

Here are a few popular rate limiter service providers that offer rate limiting solutions:

Cloudflare Rate Limiting: Cloudflare offers a comprehensive suite of security and performance solutions, including rate limiting capabilities. Their Rate Limiting feature allows you to set up custom rate limits for your APIs, websites, or applications. It integrates seamlessly with other Cloudflare services like CDN, DDoS protection, and WAF (Web Application Firewall).
Akamai API Gateway: Akamai’s API Gateway provides robust rate limiting capabilities to protect your APIs from abuse and ensure fair usage. It offers fine-grained control over rate limits, allowing you to set limits based on various parameters like IP addresses, API keys, or user roles. Akamai’s solution also includes caching, authentication, and analytics features.
AWS API Gateway: Amazon Web Services (AWS) API Gateway allows you to set up rate limiting for your APIs using its built-in functionality. With API Gateway, you can define rate limits based on the number of requests per second, minute, hour, or day. It integrates well with other AWS services and provides scalability and high availability.
Google Cloud Endpoints: Google Cloud Endpoints provides rate limiting as part of its API management platform. It enables you to define and enforce rate limits for your APIs, protecting them from excessive usage. Cloud Endpoints offers flexibility in setting rate limits based on various criteria, such as API keys, OAuth scopes, or user identities.
Kong: Kong is an open-source API gateway that offers rate limiting capabilities. It allows you to configure rate limits based on the number of requests, requests per minute, or requests per hour. Kong can be deployed on-premises or in the cloud and provides additional features like authentication, logging, and analytics.
Redis Cell: Redis Cell is a rate limiting library built on top of Redis, an in-memory data store. It provides a simple and efficient way to implement rate limiting in your applications. Redis Cell supports various rate limiting algorithms and can be integrated into your existing Redis infrastructure.

Rate Limiter: Defend Your APIs against DDoS Attacks

Problem

Why not auto-scale?

When to use rate limiter?

To Prevent Denial Of Service

To Prevent Brute Force Attacks

To Prevent Resource starvation

To Prevent Cascading Failures

Bulk Operations or Batch Processing

Rate Limiter at Different Levels

User-Level Rate Limiting

Server-Level Rate Limiting

Geography-Based Rate Limiting

Designing Rate Limiter

Functional Requirements

Non Functional Requirements

Where to put Rate Limiter?

Client Side

Server Side

Middleware

Algorithms for Rate Limiter

Token Bucket

Leaky Bucket algorithm

Fixed Window Counter

Sliding Window Logs

Key Design Considerations

Rate Limiter Providers

Share this: