System Designintermediate

Designing a Rate Limiter

A system design deep dive into building a distributed rate limiter, exploring token bucket, sliding window, and fixed window algorithms with Redis-backed implementations.

April 20, 20244 min read
system-designrate-limitingredisdistributed-systemsapi-design

Problem Statement

Design a rate limiter that controls the number of requests a client can send to an API within a given time window. The rate limiter should be distributed (works across multiple API servers), configurable per endpoint, and add minimal latency to request processing.

Requirements

Functional Requirements

  • Limit requests per client (by API key, IP address, or user ID)
  • Support different rate limits per API endpoint
  • Return appropriate HTTP 429 (Too Many Requests) with retry-after header
  • Support multiple rate limiting algorithms
  • Allow rate limit configuration changes without restart

Non-Functional Requirements

  • Latency: Add less than 5ms overhead per request
  • Availability: Rate limiter failure should not block legitimate traffic (fail-open)
  • Accuracy: Minimal over-counting or under-counting of requests
  • Scale: Handle 1M+ requests per second across the fleet

Rate Limiting Algorithms

1. Token Bucket

Tokens are added to a bucket at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Pros: Smooth rate limiting, allows short bursts Cons: Two parameters to tune (bucket size, refill rate)

2. Fixed Window Counter

Count requests in fixed time windows (e.g., per minute). Reset count at the start of each window.

Pros: Simple to implement, memory efficient Cons: Burst at window boundaries — a client can send 2x the limit by timing requests at the end of one window and start of the next

3. Sliding Window Log

Store the timestamp of each request. Count requests within the sliding window by removing expired timestamps.

Pros: Accurate, no boundary issues Cons: Memory intensive — stores every request timestamp

4. Sliding Window Counter

Hybrid approach: combine the current window count with a weighted portion of the previous window count.

Pros: Memory efficient, reasonably accurate Cons: Approximate — not perfectly accurate but within acceptable bounds

Architecture

Components

  1. Rate Limiter Middleware: Intercepts every API request before it reaches the application
  2. Rules Engine: Stores rate limit configurations per endpoint and client tier
  3. Counter Store: Redis cluster storing request counts
  4. Monitoring: Tracks rate limit hits, false positives, and latency overhead

Request Flow

  1. Request arrives at API gateway
  2. Rate limiter middleware extracts client identifier (API key)
  3. Looks up rate limit rules for the endpoint + client tier
  4. Queries Redis for current request count
  5. If under limit: increment counter, forward request
  6. If over limit: return 429 with Retry-After header

Database Choice

Redis is the ideal choice for rate limiting:

  • Speed: Sub-millisecond operations for increment and get
  • Atomic operations: INCR with EXPIRE provides atomic counter increment with TTL
  • Distributed: Redis Cluster supports the distributed counter pattern
  • Lua scripting: Complex rate limiting logic can run atomically on the server side

Redis Implementation (Token Bucket)

-- Lua script for atomic token bucket rate limiting
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_tokens
local last_refill = tonumber(data[2]) or now

-- Calculate tokens to add based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(max_tokens, tokens + (elapsed * refill_rate))

if new_tokens >= 1 then
    redis.call('HMSET', key, 'tokens', new_tokens - 1, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) * 2)
    return 1  -- Request allowed
else
    return 0  -- Request denied
end

Scaling Strategies

Local + Global Rate Limiting

Use a two-tier approach:

  • Local: Each API server maintains an in-memory counter (fast, no network hop)
  • Global: Periodically sync with Redis for cross-server accuracy
  • Tradeoff: Slightly less accurate but significantly lower latency

Redis Cluster Partitioning

Partition rate limit keys across Redis nodes by client ID hash. This distributes the load and prevents any single Redis node from becoming a bottleneck.

Failure Handling

Redis Unavailable

Fail open: Allow all requests through. It's better to temporarily allow extra traffic than to block all users. Log the failure for alerting.

Clock Skew

Use Redis server time (TIME command) instead of application server time to avoid issues with clock skew across distributed servers.

Race Conditions

The Lua script approach eliminates race conditions by executing the check-and-increment atomically on the Redis server.

Tradeoffs

  1. Accuracy vs. Latency: Local counters are faster but less accurate; centralized Redis is accurate but adds network latency
  2. Memory vs. Precision: Sliding window log is perfectly accurate but uses O(n) memory per client; fixed window uses O(1) but has boundary issues
  3. Fail-open vs. Fail-closed: Fail-open risks abuse during outages; fail-closed risks blocking legitimate traffic
  4. Per-server vs. Global limits: Per-server limits are simpler but require knowing the fleet size; global limits handle auto-scaling correctly