EVOKE
ossOS-like memory management for the LLM KV cache.
Long-running LLM agent sessions outgrow the physical KV cache budget within a few turns. EVOKE evicts low-relevance blocks under budget pressure and recovers them recompute-free via a custom save/restore primitive in a forked llama.cpp: 20–32× faster than re-prefilling the same tokens.
Demos

A 14-turn session with a 1024-token budget. A fact is planted at turn 1 (favorite number = 4242), 12 unrelated knowledge questions fill the session, and at turn 14 the fact is probed. The session survives 40 evictions and 13 recoveries, and the model recalls “4242”.

Same demo on a hybrid Mamba/Attention architecture. The model emits a <think>...</think> trace each turn. With EVOKE_SUPPRESS_THINKING_STRIP=1 the server keeps the thinking trace in the returned content so the cached state stays aligned with what the client echoes back, and no session resets fire. 26 evictions, 4 recoveries, fact recalled.
What it actually is
- Two new C++ primitives in a forked llama.cpp:
llama_kv_block_saveandllama_kv_block_load. They serialise a position range’s K/V tensors to a host buffer and splice them back with per-cell RoPE re-anchoring, with nollama_decodecall. - A third C primitive
llama_attn_capture_*that taps per-head softmax attention weights from one or more chosen layers (up to 16) into a host buffer once per decode. Used by the relevance scorer to learn what the model is actually attending to. - A Python policy layer (
evoke/manager.py,evoke/scorer.py,evoke/attention_scorer.py) that drives eviction under a watermark policy via a multi-signal scorer (model attention + harness priority + task-focus coherence + recency) and routes recovery through three pluggable backends:discard,breadcrumb, orkv_restore(the recompute-free splice). - An OpenAI-compatible chat-completions server that exposes EVOKE as a stateful endpoint. The persistent KV cache survives across requests; only the new tail of each prompt is decoded.
- Cross-architecture coverage: pure attention with standard RoPE (Qwen 2.5, Llama 3), hybrid Mamba/Attention (Qwen 3.5), MoE attention with mrope and thinking mode (Qwen 3.6 35B-A3B).
How does the system know what’s relevant?
Four signals, combined into a per-block score in [0, 1]. Lowest scores get evicted first when the cache exceeds budget.
- The model’s own attention. A second softmax for one or more chosen transformer layers runs alongside the main attention path, writing per-head softmax weights to a host buffer once per decode. The scorer maintains a sliding window of recent attention mass per block (last 64 decode steps, EWMA decay 0.95). Blocks the model is actually attending to score high. This is the strongest single signal — the truest answer to “what’s relevant right now.”
- Harness-supplied priority tags. A coding harness like opencode or Claude Code can set
evoke_priority(a float multiplier) andevoke_pinned(boolean, excluded from eviction entirely) on each chat request. Useful when the harness knows things the model can’t see: a file read is the central artifact of the current task; a tool scratch output is one-shot. Defaults to1.0 / false. - Task-focus coherence. The scorer tracks a single task-focus embedding that updates via EMA on new user messages but snaps to the new message when a topic shift is detected (cosine drop below 0.3) or signaled by the harness via
evoke_task_boundary=true. Blocks from a prior task lose their coherence score in one pass instead of decaying over five turns. - Recency, sink protection, source-type floors. Stability priors: prevent thrashing on a single attention spike; protect StreamingLLM-style sink tokens; give USER and ASSISTANT turns a floor so the conversation backbone isn’t evicted before document content.
Final score: min(priority * (w_attn·attn + w_rec·recency + w_coh·coherence) / Σw, 1.0) lifted by a source-type floor (USER blocks 0.6, ASSISTANT blocks 0.5 by default) and with pinned-block protection.
Latency
Measured on Qwen 2.5 7B, RTX 4070 Ti SUPER, Flash Attention enabled. kv_block_load is the EVOKE recovery path; re-prefill is the cost of re-encoding the same tokens via llama_decode.
| Block (tokens) | save (ms) | load (ms) | re-prefill (ms) | speedup |
|---|---|---|---|---|
| 20 | 1.10 | 0.48 | 11.90 | 25× |
| 40 | 1.61 | 0.70 | 13.78 | 20× |
| 160 | 4.69 | 1.50 | 32.60 | 22× |
| 640 | 16.37 | 4.34 | 118.36 | 27× |
| 1280 | 31.90 | 7.25 | 232.18 | 32× |
The gap widens linearly with block size: re-prefill is O(tokens × model_FLOPs), load is O(tokens × bytes).
Verified end-to-end
The mechanism has been run, measured, and stress-tested against a real coding agent.
- Live opencode session against Qwen 3.5 9B (hybrid Mamba/Attention + thinking, budget = 2048). 250 cumulative evictions, 4 smart-recoveries,
active_tokensheld near 1414 (within budget) whilecached_tokensgrew to 32 902. The agent’s conversation was 23× larger than what was actually held in GPU at any moment. - A real bug was caught and fixed during this live integration.
_evictable_blockswas over-pinning prompt-decodedDOCUMENTblocks underpin_generated, which silently zeroed evictions on the server path used by external harnesses. Root-cause traced via two reproducer scripts, fix in one targeted condition, 106 unit tests still pass. - All paper numbers are reproducible from the scripts in
scripts/. Raw output for the agentic eval, attention-scorer ablation, and keepalive workload are checked into the repo. - Three model families verified at the primitive level (Qwen 2.5 7B, Qwen 3.5 9B, Qwen 3.6 35B-A3B). Full server-side evaluation is on Qwen 2.5 7B; cross-architecture latency and scorer ablations are explicit limitations in the paper.
Status
Research prototype, targeting both a working system and a paper draft. The mechanism is verified end-to-end across three model families. Recently closed: tools-aware Jinja chat template (so tool-using turns no longer trigger session resets), multi-session pool with state swap on a custom X-EVOKE-Session header, iSWA dual-cache support in the fork primitives, multi-layer attention capture (up to 16 layers per decode).
License: Apache-2.0 on the policy layer; MIT-licensed on the forked llama.cpp work.