Distributed Cache System

Overview

A production-grade distributed caching system built from scratch to understand the internals of systems like Redis Cluster and Memcached. The system supports key-value storage with TTL, consistent hashing for data distribution, leader-based replication for fault tolerance, and multiple eviction policies.

Problem

Single-node caches become a bottleneck and single point of failure at scale. This project explores the engineering challenges of distributing cache state across multiple nodes while maintaining low latency, high availability, and consistency guarantees.

Architecture

The system consists of three main components:

Cache Nodes: Each node runs an in-memory key-value store with configurable eviction (LRU, LFU, TTL-based)
Hash Ring: Consistent hashing with virtual nodes distributes keys across cache nodes, minimizing redistribution when nodes join or leave
Replication Manager: Raft-based consensus for leader election and log replication ensures data survives node failures
Client Library: A Go client that handles node discovery, request routing, and automatic failover

Tradeoffs

CP over AP: Chose consistency over availability — writes are acknowledged only after replication to a quorum, which adds latency but prevents stale reads
gRPC over HTTP: Binary protocol reduces serialization overhead for the high-frequency cache operations
Memory-only over disk-backed: No persistence to disk keeps latency predictable but means a full cluster restart loses all data

Implementation

Key engineering decisions:

Virtual nodes: Each physical node maps to 150 virtual nodes on the hash ring, ensuring even key distribution even with heterogeneous hardware
Gossip protocol: Nodes exchange health information via a gossip protocol, enabling fast failure detection without a centralized coordinator
Batch operations: Multi-get and multi-set operations are split by destination node and executed in parallel
Backpressure: Connection pooling with circuit breakers prevents cascade failures when a node becomes slow

Challenges

Handling split-brain scenarios during network partitions required careful Raft configuration and partition-aware client routing
Achieving sub-millisecond p99 latency while maintaining replication guarantees
Hot key detection and mitigation — popular keys are replicated to additional nodes with read-your-writes consistency
Memory fragmentation management in long-running Go processes

Future Improvements

Redis-compatible protocol support for drop-in replacement
Tiered storage with SSD backing for warm data
Cluster auto-scaling based on memory pressure and hit rate metrics
Cross-datacenter replication for geo-distributed deployments