2026-05-12 · architecture / rust

Three cores, one runtime: how UniLLM keeps 47 architectures honest

Name: unillm
Author: Cognisoc

A walk through TensorCore, ModelCore, and WeightLoaderCore — the three traits that let UniLLM support 47 model families without forking a runtime per architecture.

UniLLM is described in one line at the top of its README as “a modular LLM inference runtime written in Rust.” The interesting word in that sentence is modular. The runtime supports forty-seven model architectures — LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, Mixtral, GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT, StarCoder, CodeLlama, OLMo, Granite, Yi, Falcon, Baichuan, InternLM, ChatGLM, BERT, T5, Whisper, CLIP, LLaVA, Mamba, MiniCPM, DeepSeek-MoE, DBRX, Grok, Arctic, Jamba, RWKV-4, RWKV-6, RecurrentGemma, Qwen2-VL, Phi-3-Vision, InternVL, CogVLM, Idefics, Florence, Wav2Vec2, HuBERT, MusicGen, Encodec — across ten categories, and it does so without that list collapsing the runtime under its own weight. The trick is that almost everything in UniLLM is a consequence of three trait families.

This post walks through those three traits, what they actually contain, and how they keep new architectures cheap to add.

The shape of the problem

Inference runtimes accumulate complexity in three different axes:

Hardware. CPU and GPU memory layouts are different; tensor primitives have to dispatch to the right backend. Every operator multiplied by every device is a combinatorial mess if you don’t compress it through a trait.
Architecture. A LLaMA-style decoder, a Mamba state-space block, a vision-language adapter, and a Whisper encoder are different shapes of computation. Done naively, each has its own pipeline.
Weights. Model checkpoints arrive in SafeTensors, GGUF (often quantized), and PyTorch .pt files. Each format has its own tensor name conventions and dtype rules.

UniLLM’s design is to give each axis one home, and refuse to let them leak into each other. TensorCore owns the hardware axis. ModelCore owns the architecture axis. WeightLoaderCore owns the format axis. New work happens in one place.

TensorCore

The first axis collapses into a single Tensor type:

pub struct Tensor {
    data: Arc<dyn TensorStorage>,
    shape: Vec<usize>,
    dtype: DataType,
    device: Device,
}

Arc<dyn TensorStorage> is doing real work here. It means the tensor’s storage is reference-counted and abstract: a CPU buffer, a CUDA allocation, or a Metal allocation can all sit behind the same struct, and cloning a tensor handle is cheap. The shape, dtype, and device are concrete and live next to the data. There is no CudaTensor or CpuTensor — there is one Tensor, and the device is a field.

Operations live on a TensorOps trait — matmul, attention, layer_norm, rms_norm, embedding, softmax, silu, gelu, reshape, transpose, concat, slice — and on a functional ops_fn module that wraps the trait so model code reads as a sequence of plain function calls instead of method chains. Devices are an enum: CPU, CUDA(usize), Metal(usize), with a Device::auto() that picks the best one available.

Today the only fully exercised backend is CPU. The ROADMAP is honest about that: GPU acceleration is on the near-term priority list, and the tensor abstraction is wired for CUDA and Metal via Candle feature flags, but no GPU-specific optimisation has been done. The shape of TensorOps means that work happens behind the trait, not above it — model code does not change when CUDA lands.

What does change in TensorCore is the kernel layer underneath. UniLLM already ships AVX2, AVX-512, and NEON SIMD implementations for the operators that dominate decoder time — quantized matmul, RMSNorm, RoPE, and SwiGLU. Those kernels are an implementation detail of CPU TensorOps, invisible to model code.

ModelCore

The second axis collapses into a single Model trait:

pub trait Model: Send + Sync {
    type Config: ModelConfig;

    fn new(config: Self::Config) -> Result<Self>;
    fn from_weights(config: Self::Config, weights: ModelWeights) -> Result<Self>;
    fn forward(&self, inputs: &ModelInputs) -> Result<ModelOutputs>;
    fn generate(&self, prompt: &str, config: &GenerationConfig) -> Result<String>;
    fn to_device(&mut self, device: &Device) -> Result<()>;
}

The inputs and outputs are enums, not raw tensors. ModelInputs::Text carries input_ids and an optional attention mask. ModelInputs::Image, ::Multimodal, and ::Audio exist for the families that need them. ModelOutputs::Logits carries logits and optional hidden states; ::Embeddings and ::Multimodal cover the other shapes. The type system encodes which adapter you need for which family — a Whisper model and a LLaMA model do not pretend to have the same input.

The boilerplate cost of adding an architecture is collapsed by a single macro:

model_config!(MyModelConfig {
    vocab_size: usize = 32000,
    hidden_size: usize = 4096,
    num_hidden_layers: usize = 32,
});

That call generates Default, Clone, Send, Sync, and ModelConfig impls. What’s left for the engineer is the structure of the model and its forward pass. The architecture docs put it bluntly: “All 47 model architectures use identical patterns.” That is the macro speaking.

A caveat the ROADMAP is careful to name: of the 47 architectures, only LLaMA is validated end-to-end with real GGUF weights. The other 46 have correct forward-pass implementations covered by unit tests with dummy tensors, but most have not been run against real model checkpoints yet. Validating Qwen, Phi, Gemma, Mistral, and DeepSeek with real weights from Ollama is the first item on the near-term priority list.

WeightLoaderCore

The third axis collapses into a WeightLoader:

impl WeightLoader {
    pub fn from_safetensors<P: AsRef<Path>>(path: P) -> Result<ModelWeights>;
    pub fn from_gguf<P: AsRef<Path>>(path: P) -> Result<ModelWeights>;
    pub fn from_pytorch<P: AsRef<Path>>(path: P) -> Result<ModelWeights>;
    pub fn auto_detect<P: AsRef<Path>>(path: P) -> Result<ModelWeights>;
}

Every loader returns the same ModelWeights: a HashMap<String, Tensor> plus some format metadata. Models receive weights through that container and never look at the file format. auto_detect is the entry point most users hit — give it a path, and the loader sniffs the format.

GGUF support includes dequantization for the common Q4_0 and Q8_0 cases. Today, GGUF weights are dequantized to f32 at load time before any inference happens. Running quantized matmul directly on Q4 / Q8 data — keeping the weights packed in their original layout — is a separate item on the long-term list. It will be a TensorCore change, not a model change.

What the layers buy

Two concrete things follow from this layout.

Adding a model is a forward pass and a config. You name the config with model_config!, define the struct that holds embeddings, layers, norm, and lm_head, implement Model::forward using ops_fn::matmul, ops_fn::rms_norm, and friends, and you are done. The Cargo workspace already has 47 worked examples to copy from.

Adding a backend is a TensorOps impl. When the CUDA backend lands, it is a new implementation of the TensorOps trait, plus a Device::CUDA(_) dispatch. Models do not move. The scheduler does not move. The KV cache does not move.

That is what UniLLM means by modular: not a config file with a hundred toggles, but three small trait families with one responsibility each.

A worked example: adding a model

The pattern is concrete enough to be worth showing end-to-end. The architecture document lays it out as four steps:

Define configuration with model_config!. The macro expands into Default, Clone, Send, Sync, and ModelConfig impls. The fields you write are the architecture-specific ones — vocab size, hidden size, number of layers, number of attention heads, intermediate size, RoPE settings, and so on.
Define the model structure. A typical decoder has config, device, embed_tokens, Vec<Layer>, norm, and lm_head, with each Layer holding its own attention and MLP weights.
Implement the Model trait. new constructs the model with zero tensors. from_weights(config, weights) pulls named tensors out of the ModelWeights container and assembles the struct. forward runs the architecture-specific computation using ops_fn. generate and to_device come from default implementations or model-specific overrides.
Export from models_v2/mod.rs.

The only architecture-specific code is the forward pass — which is the part you actually want to read when you’re studying a model. The trait machinery does not get in the way.

Where the layers don’t cross

It’s worth naming the discipline the design enforces, because that’s what keeps the runtime healthy as it grows:

TensorCore does not know what a model is. It knows tensors, devices, and operations.
ModelCore does not know what file format weights came from. It knows the shape of a Model and what ModelInputs / ModelOutputs look like.
WeightLoaderCore does not know which model the weights are for. It knows formats and how to expose tensors by name.

When those boundaries are kept, a change to SIMD kernels does not perturb model code. A change to GGUF parsing does not perturb model code. A new model architecture does not perturb the loader. Forty-seven architectures multiplied by three weight formats multiplied by three devices stays a sum, not a product. That is the property that lets a small team carry forty-seven model families without losing the thread.

Read the architecture document and the roadmap for the unabridged version — both are part of the repo and kept current with the code.

Source: github.com/cognisoc/unillm · Docs: docs.cognisoc.com/unillm/