Rust · LLM runtime

An LLM inference runtime, built in Rust.

UniLLM provides a unified, type-safe interface for running large language models across 47 architectures. Three composable abstractions — TensorCore, ModelCore, and WeightLoaderCore — let you load weights in any format, run inference on any device, and add new model architectures with minimal boilerplate.

Get started View on GitHub

SafeTensors GGUF PyTorch AVX-512 · NEON Apache-2.0

main.rs

use unillm_runtime::{Model, WeightLoader, Device};

fn main() -> Result<()> {
    // Format-agnostic: SafeTensors, GGUF, or PyTorch
    let weights = WeightLoader::auto_detect("tinyllama.gguf")?;
    let model = Model::from_weights(weights, Device::Cpu)?;

    let out = model.generate("Explain gravity", 128)?;
    println!("{out}");
    Ok(())
}

What is UniLLM?

A type-safe inference runtime you put under your own service.

UniLLM is a modular LLM inference runtime written in Rust. Instead of wrapping someone else's Python server, you get one Tensor type, one Model trait, and one format-agnostic weight loader. Adding a new architecture is mostly writing its forward pass; running it is a matter of picking a device. It is Apache-2.0, CPU-only today, with GPU acceleration on the near-term roadmap.

The problem

LLM runtimes make you choose between formats, devices, and models.

Without a unified runtime

Each weight format wants its own loader and its own quirks.
Device support leaks into model code, so CPU and GPU paths diverge.
Every new architecture is an engineering project, not a forward pass.
You end up wrapping a Python server you cannot read or tune.

With UniLLM's three cores

One WeightLoader auto-detects SafeTensors, GGUF, and PyTorch.
A device-agnostic Tensor keeps model code the same across backends.
The model_config! macro reduces a new model to its forward pass.
Typed, ahead-of-time-compiled Rust you own end to end.

Architecture

Three composable layers

The runtime is split along three axes — the tensor, the model, and the weights — so each can evolve independently without leaking through the others.

TensorCore

A single Tensor type with device-agnostic storage and a unified TensorOps trait. CPU dispatch today; CUDA and Metal variants land via the same interface.

ModelCore

One Model trait with forward() and generate(). The model_config! macro generates the boilerplate so a new architecture is mostly its forward pass.

WeightLoaderCore

Format-agnostic loader for SafeTensors, GGUF (with Q4_0/Q8_0 dequantization), and PyTorch files, returning a unified ModelWeights container.

model architectures on one Model trait

weight formats: SafeTensors, GGUF, PyTorch

201

passing tests across the workspace

Apache-2.0

open source, no strings

Numbers reflect the current v0.1.0 snapshot. LLaMA is validated end-to-end with real GGUF weights; the remaining architectures are unit-tested and validated as real weights come online.

Quick start

Clone, type-check, and run a generation. TinyLlama (~600 MB) downloads on first run.

git clone https://github.com/cognisoc/unillm.git
cd unillm
cargo check
cargo test --workspace

# Generate text (downloads TinyLlama on first run, ~600 MB)
cargo run --bin unillm -p unillm-runtime -- \
  generate --prompt "Explain gravity"

Inference today runs on CPU. The tensor abstraction is wired for CUDA and Metal via Candle feature flags, but GPU-specific optimization is still on the roadmap. See the ROADMAP for an honest snapshot of what works and what doesn't.

47 model architectures, one trait

Every architecture — from LLaMA to RWKV-6, from Whisper to Mamba — implements the same Model trait. LLaMA is validated end-to-end with real GGUF weights; the other 46 architectures have correct forward-pass implementations covered by unit tests with dummy tensors and are being validated as real weights come online.

Core LLMs

LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, Mixtral.

GPT family

GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT.

MoE

DeepSeek-MoE, DBRX, Grok, Arctic, Jamba.

Linear attention

RWKV-4, RWKV-6, RecurrentGemma, Mamba.

Vision-Language

Qwen2-VL, Phi-3-Vision, InternVL, CogVLM, LLaVA, CLIP.

Audio / Speech

Whisper, Wav2Vec2, HuBERT, MusicGen, Encodec.

Performance levers, by design

SIMD kernels

AVX2, AVX-512, and NEON implementations for quantized matmul, RMSNorm, RoPE, and SwiGLU.

Hybrid KV cache

RadixAttention + PagedAttention with an adaptive tiering policy. Wiring into the generation loop is next.

Scheduler

Request scheduling with continuous batching, chunked prefill, and admission control.

From the blog

2026-05-12
Three cores, one runtime: how UniLLM keeps 47 architectures honest

A walk through TensorCore, ModelCore, and WeightLoaderCore — the three traits that let UniLLM support 47 model families without forking a runtime per architecture.
2026-04-28
RadixAttention plus PagedAttention: the UniLLM KV cache, explained

Why UniLLM's KV cache is hybrid, what RadixAttention and PagedAttention each contribute, and the honest state of integration today.
2026-04-10
Weight loading without the format wars: SafeTensors, GGUF, and PyTorch under one trait

How UniLLM's WeightLoaderCore makes SafeTensors, GGUF, and PyTorch checkpoints interchangeable from a model's point of view, and what dequantization looks like today.

All posts →

FAQ

Common questions

What is UniLLM?

UniLLM is a modular LLM inference runtime written in Rust. It provides a unified, type-safe interface for running large language models across 47 architectures through three composable abstractions: TensorCore, ModelCore, and WeightLoaderCore.

Does UniLLM run on GPU?

Not yet. Inference runs on CPU today. The tensor abstraction is wired for CUDA and Metal via Candle feature flags, but GPU-specific optimization is on the near-term roadmap.

Which weight formats does it load?

WeightLoaderCore reads SafeTensors, GGUF (with Q4_0 / Q8_0 dequantization to f32 at load time), and PyTorch files, returning a unified ModelWeights container with auto-detection.

Are all 47 architectures production-ready?

LLaMA is validated end-to-end with real GGUF weights. The other 46 architectures have correct forward-pass implementations covered by unit tests with dummy tensors and are being validated as real weights come online.

All FAQs →

Put a real inference runtime under your service.

Clone it, type-check it, and generate your first tokens on CPU in a few minutes. Then read the architecture and wire it into your own Rust.

Get started How it works