About UniLLM

UniLLM is a modular LLM inference runtime written in Rust. It exists to make running large language models — on whatever device, in whatever weight format, with whatever architecture — a problem of writing one well-typed forward pass instead of an engineering project.

What the runtime is

UniLLM is organised as a small Cargo workspace. The crates split along clear lines so each can change without dragging the others along:

crates/runtime — core inference runtime: tensor ops, the Model trait, weight loading, and the 47 architecture implementations.
crates/inference — the high-level inference engine and batching.
crates/kv — the hybrid KV cache (RadixAttention + PagedAttention).
crates/scheduler — request scheduling with continuous batching and chunked prefill.

The three core abstractions

Almost everything in UniLLM is built out of three traits and the types they carry. They are deliberately small.

TensorCore

A single Tensor struct backed by Arc<dyn TensorStorage>, with a Device enum covering CPU, CUDA(usize), and Metal(usize). Operations go through a TensorOps trait and a functional ops_fn module so model code reads as plain Rust functions, not method chains on opaque types.

ModelCore

One Model trait with new, from_weights, forward, generate, and to_device. Inputs and outputs are typed enums — ModelInputs::Text, ::Image, ::Multimodal, ::Audio and ModelOutputs::Logits, ::Embeddings, ::Multimodal — so the type system tells you which adapter you need for which use case.

WeightLoaderCore

A WeightLoader with from_safetensors, from_gguf, from_pytorch, and auto_detect. Outputs a unified ModelWeights container so models don't need to know about the format. GGUF weights are currently dequantized to f32 at load time; direct quantized inference is on the roadmap.

What is and isn't there today

UniLLM is honest about its current state. From the project's own ROADMAP, here's a condensed view at v0.1.0:

Works today

End-to-end LLaMA inference: download from Ollama, load GGUF, tokenize, forward, generate.
Weight loading for GGUF (with Q4_0 / Q8_0 dequantization), SafeTensors, and PyTorch.
Tokenization via GGUF tokenizers (with byte-level fallback) and HuggingFace tokenizers.
Greedy, temperature, and top-p (nucleus) sampling.
47 model architectures implementing the Model trait via model_config!.
Hybrid RadixAttention + PagedAttention KV cache with adaptive tiering.
SIMD kernels (AVX2, AVX-512, NEON) for quantized matmul, RMSNorm, RoPE, SwiGLU.
Continuous batching, chunked prefill, admission control in the scheduler.
201 passing tests across the workspace.

Doesn't work yet

GPU acceleration: tensor abstraction is wired for CUDA / Metal via Candle feature flags, but no GPU-specific optimisation has been done. Today, inference runs on CPU.
Real-weight validation beyond LLaMA: the other 46 architectures pass unit tests with dummy tensors but most have not been run against real GGUF files yet.
No HTTP server. Inference is CLI or programmatic.
No token streaming to clients (the CLI prints tokens as they're produced; SSE is not yet wired up).
No repetition penalty, beam search, or min-p sampling.
KV cache is implemented and tested but not yet integrated into the autoregressive loop.

Audience

UniLLM is for Rust engineers building LLM infrastructure — the people who want to put an inference runtime under their own service rather than wrap somebody else's Python server. It's also for performance engineers who want to read and tune kernels in a typed, ahead-of-time-compiled language.

License

Apache-2.0. See LICENSE.