2026-04-10 · weights / rust

Weight loading without the format wars: SafeTensors, GGUF, and PyTorch under one trait

Name: unillm
Author: Cognisoc

How UniLLM's WeightLoaderCore makes SafeTensors, GGUF, and PyTorch checkpoints interchangeable from a model's point of view, and what dequantization looks like today.

The first cliff every new inference runtime has to climb isn’t kernels or schedulers; it’s weight loading. SafeTensors, GGUF, and PyTorch .pt checkpoints each have their own tensor name conventions, dtype rules, layout assumptions, and quantization metadata. A runtime that punts on this gets a lot of “could not load weights” issues. UniLLM doesn’t punt — WeightLoaderCore is one of the three named layers of the architecture, sitting beside TensorCore and ModelCore. This post is about what that layer actually does, what it doesn’t yet do, and why models written against UniLLM never have to care which format their weights came in.

The shape of the abstraction

The whole loader is a small surface:

pub struct WeightLoader;

impl WeightLoader {
    pub fn from_safetensors<P: AsRef<Path>>(path: P) -> Result<ModelWeights>;
    pub fn from_gguf<P: AsRef<Path>>(path: P) -> Result<ModelWeights>;
    pub fn from_pytorch<P: AsRef<Path>>(path: P) -> Result<ModelWeights>;
    pub fn auto_detect<P: AsRef<Path>>(path: P) -> Result<ModelWeights>;
}

pub struct ModelWeights {
    tensors: HashMap<String, Tensor>,
    metadata: WeightMetadata,
}

ModelWeights is the unified container. A model that implements Model::from_weights(config, weights) reaches into that map by name. It doesn’t know — and shouldn’t know — that the bytes came from a SafeTensors header, a GGUF blob, or a Python pickle.

auto_detect is the path most users hit. It sniffs the file, picks the right loader, and returns the same ModelWeights. That single entry point is what makes the CLI’s unillm generate --model llama2:7b --prompt "Hello" a one-line affair: Ollama hands us a GGUF; the loader recognises it; the model trait picks it up.

SafeTensors: the easy case

SafeTensors is the format most modern HuggingFace checkpoints ship in, and it’s the friendliest to a Rust loader. It’s mmap-able, deterministic, dtype-tagged, and explicitly designed to be loaded without arbitrary code execution (the headline contrast with PyTorch pickles). UniLLM’s SafeTensors path reads the header, walks the tensor table, builds a HashMap<String, Tensor> keyed by the format’s own tensor names, and hands the result back as ModelWeights.

The interesting decision here is layout. SafeTensors gives you raw bytes plus a shape and dtype. UniLLM wraps those bytes in a Tensor whose storage lives behind Arc<dyn TensorStorage>. On CPU, that storage can be the mmap’d region directly; the loader doesn’t have to copy. On GPU backends (which, again, are not yet exercised), the storage becomes a device allocation populated via the host buffer. Either way, the model code sees a Tensor.

GGUF: the format that earns its keep

GGUF is the format that ships from Ollama, llama.cpp, and most of the local-inference ecosystem. It’s where the interesting work happens: GGUF tensors are typically quantized — Q4_0, Q4_K, Q5_0, Q8_0, and several others — with per-block scales and zero points packed into the file. A runtime that wants to load these has to know each scheme’s bit layout.

UniLLM’s GGUF path supports dequantization for the common cases — Q4_0 and Q8_0 are the ones the README calls out explicitly. The loader walks the GGUF tensor table, recognises the quantization scheme on each tensor, and dequantizes block-by-block into f32 before storing the result in ModelWeights. From that point on, inference is f32 on CPU.

There are two honest things to say about this.

First, it works. The loader is exercised end-to-end: the CLI downloads a TinyLlama GGUF from Ollama, the loader dequantizes it, the LLaMA model picks it up via from_weights, the tokenizer runs, and the generation loop produces text. That’s the workflow the README’s quick start covers.

Second, it leaves performance on the table. Dequantizing Q4 to f32 quadruples the in-memory footprint of the weights and doubles the bandwidth pressure on every matmul. The ROADMAP names the alternative directly: “Direct quantized inference — run Q4_K, Q5_K matmul without dequantization.” That work is a TensorCore change, not a WeightLoaderCore change: the loader keeps the bytes in their packed layout, the storage marks the tensor as quantized, and the matmul operator dispatches to a quantized kernel. SIMD-flavoured implementations of those kernels already exist in the runtime — AVX2, AVX-512, and NEON quantized matmul, RMSNorm, RoPE, and SwiGLU all ship in crates/runtime — so the missing piece is the path through the loader.

Until that lands, the trade is clarity over speed: GGUF loads work, every model receives f32 tensors, and the rest of the runtime stays simple.

PyTorch: the format that requires care

PyTorch .pt files are Python pickles. Loading them outside Python means parsing the pickle stream by hand, which is error-prone and historically the source of “executes arbitrary code” warnings. UniLLM’s PyTorch path reads the pickle stream, extracts the tensor metadata, and converts the underlying storage into the runtime’s Tensor type. It then maps the PyTorch tensor names into the same HashMap<String, Tensor> shape as the other loaders.

This is the path used least in practice — almost every checkpoint that matters in 2026 is published as SafeTensors or GGUF — but it’s the one that gives UniLLM a graceful answer for the long tail of older checkpoints. The loader exists, it returns ModelWeights, and the model trait can’t tell the difference.

Tokenisers and the rest of the file

A GGUF file is more than its tensors. The same file carries the tokenizer vocabulary, the architecture metadata, and the model’s special-token configuration. The README is precise: “GGUF-based tokenizer with byte-level fallback, HuggingFace tokenizer support.” When a GGUF loads, UniLLM extracts the tokenizer state alongside the tensors, so the same file produces both the model and the tokenizer ready to feed it.

For SafeTensors and PyTorch checkpoints, the tokenizer lives in sibling files (HuggingFace’s tokenizer.json, typically), and UniLLM hooks the HuggingFace tokenizer crate. The model layer sees the same Tokenizer trait either way.

What models actually see

The point of all of this is that a model written against ModelCore doesn’t see any of it. Its from_weights(config, weights) receives a ModelWeights container, looks up embed_tokens, layers.{i}.attn.wq, and the rest by name, and constructs its internal struct. The format is a problem solved one layer down, in WeightLoaderCore. The hardware is a problem solved one layer down from that, in TensorCore.

When direct quantized inference lands, models still don’t change. When new formats appear, models still don’t change. That’s the whole bargain of the three-layer design. The README names it as a benefit; the architecture document spells it out; and the loader code is where it actually happens.

The loading flow, end to end

It’s worth tracing the path a single GGUF takes through the system, because the layered abstraction makes it easy to forget how much actually happens between “user has a file” and “model is generating text.”

The CLI receives generate --model llama2:7b --prompt "Hello". It resolves llama2:7b against the Ollama cache and discovers a local GGUF.
WeightLoader::auto_detect inspects the file’s magic bytes, recognises GGUF, and dispatches to the GGUF path.
The GGUF parser walks the file header, builds the tensor table, extracts the tokenizer state, and reads architecture metadata (number of layers, hidden size, attention heads, RoPE parameters, etc.).
For each tensor, the parser checks the quantization scheme. Q4_0 and Q8_0 blocks are dequantized to f32 in-place into a fresh allocation; non-quantized tensors are read directly. Each becomes a Tensor with the right shape, dtype, and a CPU storage backend.
The loader assembles the HashMap<String, Tensor> keyed by GGUF’s tensor names, attaches the metadata (architecture name, source format, dtype info), and returns a ModelWeights.
The LLaMA model’s Model::from_weights(config, weights) walks the map by name: token_embd.weight becomes embed_tokens, blk.{i}.attn_q.weight becomes layer i’s wq, and so on. The struct is built.
The generation loop tokenises the prompt with the tokenizer the loader extracted, wraps the tokens in a ModelInputs::Text, calls Model::forward, samples a next token with the chosen sampler, and repeats.

Every step crosses a clean boundary. The loader doesn’t know which model it’s loading for. The model doesn’t know what format the file was in. The generation loop doesn’t know that the tokenizer was lifted out of the same GGUF that produced the weights. Each layer has one responsibility.

What changes when direct quantized inference lands

The most consequential follow-up to the current loader is direct quantized inference. The f32-dequant-on-load path is correct and simple, but it’s the biggest performance lever the runtime hasn’t yet pulled. The change is small in the loader and bigger in TensorCore: the loader keeps quantized blocks in their packed layout; Tensor storage learns it can be a quantized container; TensorOps::matmul dispatches on the operand’s dtype, with Q4_K and Q5_K operands hitting quantized kernels. Memory pressure drops roughly 4x for Q4 weights, and bandwidth on the matmul hot path drops with it. The model code doesn’t change at all.

That bargain — a small loader change opens a large perf budget — is only possible because the three responsibilities were kept separate. For the full list of formats and the live state of every subsystem, see the ROADMAP and ARCHITECTURE docs.

Source: github.com/cognisoc/unillm · Docs: docs.cognisoc.com/unillm/