An LLM inference runtime, built in Rust.
UniLLM provides a unified, type-safe interface for running large language models across
47 architectures. Three composable abstractions — TensorCore,
ModelCore, and WeightLoaderCore — let you load weights in any
format, run inference on any device, and add new model architectures with minimal
boilerplate.
Three composable layers
The runtime is split along three axes — the tensor, the model, and the weights — so each can evolve independently without leaking through the others.
TensorCore
A single Tensor type with device-agnostic storage and a unified
TensorOps trait. CPU dispatch today; CUDA and Metal variants land via the
same interface.
ModelCore
One Model trait with forward() and generate().
The model_config! macro generates the boilerplate so a new
architecture is mostly its forward pass.
WeightLoaderCore
Format-agnostic loader for SafeTensors, GGUF (with Q4_0/Q8_0 dequantization), and
PyTorch files, returning a unified ModelWeights container.
Quick start
Clone, type-check, and run a generation. TinyLlama (~600 MB) downloads on first run.
git clone https://github.com/cognisoc/unillm.git
cd unillm
cargo check
cargo test --workspace
# Generate text (downloads TinyLlama on first run, ~600 MB)
cargo run --bin unillm -p unillm-runtime -- \
generate --prompt "Explain gravity" Inference today runs on CPU. The tensor abstraction is wired for CUDA and Metal via Candle feature flags, but GPU-specific optimization is still on the roadmap. See the ROADMAP for an honest snapshot of what works and what doesn't.
47 model architectures, one trait
Every architecture — from LLaMA to RWKV-6, from Whisper to Mamba — implements
the same Model trait. LLaMA is validated end-to-end with real GGUF weights;
the other 46 architectures have correct forward-pass implementations covered by unit tests
with dummy tensors and are being validated as real weights come online.
Core LLMs
LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, Mixtral.
GPT family
GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT.
MoE
DeepSeek-MoE, DBRX, Grok, Arctic, Jamba.
Linear attention
RWKV-4, RWKV-6, RecurrentGemma, Mamba.
Vision-Language
Qwen2-VL, Phi-3-Vision, InternVL, CogVLM, LLaVA, CLIP.
Audio / Speech
Whisper, Wav2Vec2, HuBERT, MusicGen, Encodec.
Performance levers, by design
SIMD kernels
AVX2, AVX-512, and NEON implementations for quantized matmul, RMSNorm, RoPE, and SwiGLU.
Hybrid KV cache
RadixAttention + PagedAttention with an adaptive tiering policy. Wiring into the generation loop is next.
Scheduler
Request scheduling with continuous batching, chunked prefill, and admission control.
From the blog
-
A walk through TensorCore, ModelCore, and WeightLoaderCore โ the three traits that let UniLLM support 47 model families without forking a runtime per architecture.
-
Why UniLLM's KV cache is hybrid, what RadixAttention and PagedAttention each contribute, and the honest state of integration today.
-
How UniLLM's WeightLoaderCore makes SafeTensors, GGUF, and PyTorch checkpoints interchangeable from a model's point of view, and what dequantization looks like today.