Rust ยท LLM runtime

An LLM inference runtime, built in Rust.

UniLLM provides a unified, type-safe interface for running large language models across 47 architectures. Three composable abstractions — TensorCore, ModelCore, and WeightLoaderCore — let you load weights in any format, run inference on any device, and add new model architectures with minimal boilerplate.

Three composable layers

The runtime is split along three axes — the tensor, the model, and the weights — so each can evolve independently without leaking through the others.

TensorCore

A single Tensor type with device-agnostic storage and a unified TensorOps trait. CPU dispatch today; CUDA and Metal variants land via the same interface.

ModelCore

One Model trait with forward() and generate(). The model_config! macro generates the boilerplate so a new architecture is mostly its forward pass.

WeightLoaderCore

Format-agnostic loader for SafeTensors, GGUF (with Q4_0/Q8_0 dequantization), and PyTorch files, returning a unified ModelWeights container.

Quick start

Clone, type-check, and run a generation. TinyLlama (~600 MB) downloads on first run.

git clone https://github.com/cognisoc/unillm.git
cd unillm
cargo check
cargo test --workspace

# Generate text (downloads TinyLlama on first run, ~600 MB)
cargo run --bin unillm -p unillm-runtime -- \
  generate --prompt "Explain gravity"

Inference today runs on CPU. The tensor abstraction is wired for CUDA and Metal via Candle feature flags, but GPU-specific optimization is still on the roadmap. See the ROADMAP for an honest snapshot of what works and what doesn't.

47 model architectures, one trait

Every architecture — from LLaMA to RWKV-6, from Whisper to Mamba — implements the same Model trait. LLaMA is validated end-to-end with real GGUF weights; the other 46 architectures have correct forward-pass implementations covered by unit tests with dummy tensors and are being validated as real weights come online.

Core LLMs

LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, Mixtral.

GPT family

GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT.

MoE

DeepSeek-MoE, DBRX, Grok, Arctic, Jamba.

Linear attention

RWKV-4, RWKV-6, RecurrentGemma, Mamba.

Vision-Language

Qwen2-VL, Phi-3-Vision, InternVL, CogVLM, LLaVA, CLIP.

Audio / Speech

Whisper, Wav2Vec2, HuBERT, MusicGen, Encodec.

Performance levers, by design

SIMD kernels

AVX2, AVX-512, and NEON implementations for quantized matmul, RMSNorm, RoPE, and SwiGLU.

Hybrid KV cache

RadixAttention + PagedAttention with an adaptive tiering policy. Wiring into the generation loop is next.

Scheduler

Request scheduling with continuous batching, chunked prefill, and admission control.

From the blog

All posts →