Source

HighlightOutcome
Time-to-First-Token (TTFT)↓ up to 88%
Generation throughput↑ up to 4×
Integrated methodsCacheBlend, EPIC (xKV-ready)
Token budgetingDynamic top ~16% by L2 deviation
Evaluation scope10-query RAG case study
  • Built a unified framework that combines recomputation, compression, and eviction for KV caches in RAG workflows, cutting TTFT by up to 88% with negligible accuracy impact
  • Implemented CacheBlend & EPIC inside the vLLM stack; authored modular PyTorch recomputation layers that dynamically select the top ~16% tokens by L2 deviation over K/V tensors
  • Created an automated design-space exploration pipeline (pandas/NumPy) comparing 10+ token-selection variants (K vs V, early vs deep layers, ranked vs mid-ranked), achieving up to 4× generation speedup
  • Built diagnostic tooling for token role categorization (intrinsic / relational / dummy) and analyzed layer-wise consistency to inform xKV (SVD) compression and future adaptive cache policies