| Highlight | Outcome |
|---|---|
| Time-to-First-Token (TTFT) | ↓ up to 88% |
| Generation throughput | ↑ up to 4× |
| Integrated methods | CacheBlend, EPIC (xKV-ready) |
| Token budgeting | Dynamic top ~16% by L2 deviation |
| Evaluation scope | 10-query RAG case study |
- Built a unified framework that combines recomputation, compression, and eviction for KV caches in RAG workflows, cutting TTFT by up to 88% with negligible accuracy impact
- Implemented CacheBlend & EPIC inside the vLLM stack; authored modular PyTorch recomputation layers that dynamically select the top ~16% tokens by L2 deviation over K/V tensors
- Created an automated design-space exploration pipeline (pandas/NumPy) comparing 10+ token-selection variants (K vs V, early vs deep layers, ranked vs mid-ranked), achieving up to 4× generation speedup
- Built diagnostic tooling for token role categorization (intrinsic / relational / dummy) and analyzed layer-wise consistency to inform xKV (SVD) compression and future adaptive cache policies