About UniLLM
UniLLM is a modular LLM inference runtime written in Rust. It exists to make running large language models — on whatever device, in whatever weight format, with whatever architecture — a problem of writing one well-typed forward pass instead of an engineering project.
What the runtime is
UniLLM is organised as a small Cargo workspace. The crates split along clear lines so each can change without dragging the others along:
crates/runtime— core inference runtime: tensor ops, theModeltrait, weight loading, and the 47 architecture implementations.crates/inference— the high-level inference engine and batching.crates/kv— the hybrid KV cache (RadixAttention + PagedAttention).crates/scheduler— request scheduling with continuous batching and chunked prefill.
The three core abstractions
Almost everything in UniLLM is built out of three traits and the types they carry. They are deliberately small.
TensorCore
A single Tensor struct backed by Arc<dyn TensorStorage>, with a
Device enum covering CPU, CUDA(usize), and
Metal(usize). Operations go through a TensorOps trait and a
functional ops_fn module so model code reads as plain Rust functions, not method
chains on opaque types.
ModelCore
One Model trait with new, from_weights,
forward, generate, and to_device. Inputs and outputs
are typed enums — ModelInputs::Text, ::Image,
::Multimodal, ::Audio and ModelOutputs::Logits,
::Embeddings, ::Multimodal — so the type system tells you
which adapter you need for which use case.
WeightLoaderCore
A WeightLoader with from_safetensors, from_gguf,
from_pytorch, and auto_detect. Outputs a unified
ModelWeights container so models don't need to know about the format. GGUF
weights are currently dequantized to f32 at load time; direct quantized inference is on the
roadmap.
What is and isn't there today
UniLLM is honest about its current state. From the project's own ROADMAP, here's a condensed view at v0.1.0:
Works today
- End-to-end LLaMA inference: download from Ollama, load GGUF, tokenize, forward, generate.
- Weight loading for GGUF (with Q4_0 / Q8_0 dequantization), SafeTensors, and PyTorch.
- Tokenization via GGUF tokenizers (with byte-level fallback) and HuggingFace tokenizers.
- Greedy, temperature, and top-p (nucleus) sampling.
- 47 model architectures implementing the
Modeltrait viamodel_config!. - Hybrid RadixAttention + PagedAttention KV cache with adaptive tiering.
- SIMD kernels (AVX2, AVX-512, NEON) for quantized matmul, RMSNorm, RoPE, SwiGLU.
- Continuous batching, chunked prefill, admission control in the scheduler.
- 201 passing tests across the workspace.
Doesn't work yet
- GPU acceleration: tensor abstraction is wired for CUDA / Metal via Candle feature flags, but no GPU-specific optimisation has been done. Today, inference runs on CPU.
- Real-weight validation beyond LLaMA: the other 46 architectures pass unit tests with dummy tensors but most have not been run against real GGUF files yet.
- No HTTP server. Inference is CLI or programmatic.
- No token streaming to clients (the CLI prints tokens as they're produced; SSE is not yet wired up).
- No repetition penalty, beam search, or min-p sampling.
- KV cache is implemented and tested but not yet integrated into the autoregressive loop.
Audience
UniLLM is for Rust engineers building LLM infrastructure — the people who want to put an inference runtime under their own service rather than wrap somebody else's Python server. It's also for performance engineers who want to read and tune kernels in a typed, ahead-of-time-compiled language.
License
Apache-2.0. See LICENSE.