RadixAttention plus PagedAttention: the UniLLM KV cache, explained
Why UniLLM's KV cache is hybrid, what RadixAttention and PagedAttention each contribute, and the honest state of integration today.
The KV cache is the single most consequential data structure in an LLM inference runtime. Every decoded token reads from it; every prefilled prompt writes into it; how it grows, how it’s reused, and how it’s paged decide whether your runtime is bandwidth-bound, memory-bound, or just slow. UniLLM’s cache lives in crates/kv, and it is deliberately a hybrid: RadixAttention layered on top of PagedAttention, with an adaptive tiering policy that decides which entries belong where.
This post walks through what each half does, why both are present, and what is and is not wired up today.
What we’re caching, and why
A decoder-style transformer reads its inputs, computes a Key and Value tensor at each layer, and uses those K and V tensors to attend over all previous tokens. If you throw the K and V away after each step, you redo that work on every decoded token. If you keep them, you have a per-layer, per-token, per-head pile of state that grows linearly with the sequence and is read by every later step. That pile is the KV cache.
Two practical pressures shape every KV cache design:
- Reuse across requests. Many requests share prefixes — system prompts, chat templates, retrieved documents. If the cache can recognise those prefixes, the runtime can skip the prefill on the shared portion and only do work on the tail.
- Memory fragmentation. Sequences end at unpredictable lengths. A naive contiguous-buffer cache wastes memory on overestimated lengths and fragments badly under concurrent traffic.
PagedAttention addresses pressure #2. RadixAttention addresses pressure #1. UniLLM keeps both, which is the only honest answer once you serve production-shaped traffic.
PagedAttention, briefly
PagedAttention is the technique vLLM made widely known: instead of allocating one contiguous KV buffer per sequence, you allocate fixed-size pages, hand each sequence a list of page indices, and let pages move independently. Adding a token appends to the last page; full pages get a new one; sequences that share content can share pages without copying. The attention kernel learns to walk the page table.
The benefit is concrete: memory utilisation climbs because you stop over-allocating, and concurrent sequences stop fighting over contiguous arena. The cost is a layer of indirection in the attention kernel and a page table per request.
UniLLM’s KV crate implements this layout. The attention operator on the CPU backend reads pages through that page table; when the GPU backends come online, the same page table is what their kernels will read.
RadixAttention, briefly
RadixAttention is the prefix-cache idea: organise cached KV entries by their token-prefix identity in a radix tree. When a new request arrives, walk the tree with its prompt tokens. The longest matching path is a prefix you’ve already prefilled; you reuse those KV pages and only run the model on the suffix.
For workloads with shared system prompts, repeated retrieval documents, or chat sessions where the conversation grows by one user turn at a time, RadixAttention is the difference between O(prompt length) and O(new tokens) work per request. The radix tree lets the cache find that prefix in time proportional to the matched length, not the cache size.
The hybrid: tiering between them
The two techniques are complementary, not competing. PagedAttention is a memory layout. RadixAttention is an indexing strategy. UniLLM treats them as two layers of one cache:
- Storage: every K and V tensor for every layer is stored in PagedAttention pages. Allocation, free, and the attention kernel’s read path all go through the page table.
- Index: the RadixAttention tree maps prefix-token-sequences to lists of page indices. A cache hit returns a list of pages that can be re-attached to a new request’s page table without copying any K/V data.
What makes the cache adaptive is the tiering policy. Not every entry deserves to stay in the radix index. A one-shot, never-repeated prompt benefits from the paged layout but should not pollute the prefix tree. A system prompt that repeats a thousand times an hour should pin itself in the tree as long as memory allows. The policy decides which entries get promoted into the radix index, which get demoted out, and which pages can be reclaimed when memory pressure rises.
Where it sits in the runtime
The KV crate lives next to a scheduler crate that does the actual placement of work: continuous batching (mixing prefill and decode steps in the same batch), chunked prefill (splitting long prompts across batches so they don’t starve other requests), and admission control (refusing new requests when the cache and compute are saturated). The scheduler is the customer of the KV cache. It asks for pages, asks for prefix lookups, and gets back the page lists it needs to construct attention inputs.
Both crates ship with their tests. The README’s headline number — “201 tests pass across all workspace crates” — covers them. The cache has correct mechanics today.
What is not wired up yet
This is where UniLLM is deliberately honest, and where the ROADMAP is the document to read. Two lines from “What doesn’t work yet”:
KV cache integration with inference — The KV cache system is implemented and tested but not yet wired into the generation loop for token reuse.
GPU acceleration — All inference runs on CPU.
In other words: the KV cache crate is complete on its own terms. The autoregressive generation loop does not yet consume it. A generated token today does not pull K and V from the paged store and reuse them on the next step; it recomputes. Wiring the existing KVCache / LayerKVCache types into the generation loop is the third item on the near-term priority list, after broader model validation and GPU backends.
That is a strange place to be — a fully built cache that the runtime isn’t yet calling — but it’s the right place to be once. The cache is the place where correctness is hardest: page tables, free lists, prefix matching, and concurrent eviction all have to be right before the generation loop can trust them. UniLLM built the cache first, with tests, and is now connecting the loop.
What this means in practice
If you read the KV crate today, you’ll find:
- The page table abstraction and the paged attention kernel for CPU.
- The radix tree, the prefix-match path, and the entry-promotion / demotion logic.
- The adaptive tiering policy that decides what lives in the index.
- Tests that exercise allocation, eviction, prefix reuse, and concurrent access.
What you won’t find — yet — is the bridge from model.generate(...) into that cache. That bridge is small relative to the cache itself. It’s also the next step.
If you’re evaluating UniLLM for production, this is the right thing to track in the issue tracker: the moment the generation loop becomes cache-aware, the runtime moves from “compute every token from scratch” to “reuse what you already computed,” and the published per-token cost falls off a cliff. The ROADMAP names this work explicitly so it’s easy to watch.
Why hybrid is the conservative choice
It’s tempting to treat “hybrid” as a hedge: pick both designs because we couldn’t decide. The honest read is the opposite. PagedAttention solves a memory-layout problem; RadixAttention solves an index problem. The two answers do not compete because they’re not answering the same question. A pure paged cache without an index will waste prefill cycles on shared prompts. A pure radix cache without paging will fragment memory under concurrent traffic. Once you serve real workloads, both pressures show up. Building the cache hybrid from the start is the conservative answer, not the maximalist one.
The adaptive tiering policy is what keeps the hybrid from over-investing in one half. If a workload exhibits no prefix reuse, the radix tree stays small and pages flow through PagedAttention with normal eviction. If a workload is heavily prefix-shared, the radix tree fills up and the cache returns hits on most requests. The policy decides where each entry sits, not the runtime user.
What you can rely on, what you should watch
For a team evaluating UniLLM against the KV cache in mind, the picture is straightforward:
- You can rely on the cache mechanics: pages, page tables, radix lookups, eviction, adaptive tiering. They are implemented and covered by tests.
- You should watch for two things: GPU backends landing in
TensorCore(which is what makes the cache useful at production scale), and the generation loop becoming cache-aware (item #3 on the near-term roadmap, after model validation and GPU work).
For the long-form architecture, see docs/ARCHITECTURE.md in the repo. For the current state of every subsystem, see docs/ROADMAP.md.
Source: github.com/cognisoc/unillm · Docs: docs.cognisoc.com/unillm/