Compress & Blend: Unified KV Cache Optimization for LLMs

Built a unified framework that combines recomputation, compression, and eviction for KV caches in RAG workflows, cutting TTFT by up to 88% with negligible accuracy impact
Implemented CacheBlend & EPIC inside the vLLM stack; authored modular PyTorch recomputation layers that dynamically select the top ~16% tokens by L2 deviation over K/V tensors
Created an automated design-space exploration pipeline (pandas/NumPy) comparing 10+ token-selection variants (K vs V, early vs deep layers, ranked vs mid-ranked), achieving up to 4× generation speedup
Built diagnostic tooling for token role categorization (intrinsic / relational / dummy) and analyzed layer-wise consistency to inform xKV (SVD) compression and future adaptive cache policies